Open In Colab

Data Disovery Phase

Prepared By: Simranjeet Randhawa

Data Discovery Phase

Drafting the business problem statement. A unique quality about Yellow cabs is that they can pick up passengers from the streets without any pre-arranged booking required. Whereas in the case of Uber, passengers mainly use the app to arrange for the taxi. So I found a business opportunity for yellow cabs to get back on track by targeting the streets where the pickups demand are high. If the passenger waits too much they switch to apps to look for a taxi but if the passengers can see a yellow cab right in front of them, they will surely choose the yellow cab instead of waiting for Uber. Hence this data science projects focus to help the NYC TLC yellow taxis to get back on track and increase the number of pickups around the NYC city. Also, the yellow cab taxi drivers can make use of the website to see the forecasted demand for different zones in NYC city.

Assess resource needs and availability: Resources Needed: Yellow Taxi pickups of past years and weather data of past year day by day was needed to make any projection in the future. Also, I need the taxi operating zone dataset of NYC city. Resources Availability: Yellow taxi dataset is a publicly available dataset. The attached datasets were provided and collected by the technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs for the NYC Taxi and Limousine Commission (TLC).

1. Data Gathering

Data Gathering is the first step of the machine learning life cycle. The goal of this step is to identify and obtain all data-related problems. In this step, we need to identify the different data sources, as data can be collected from various sources such as files, database, internet, or mobile devices. It is one of the most important steps of the life cycle. The quantity and quality of the collected data will determine the efficiency of the output. The more will be the data, the more accurate will be the prediction. This step includes the below tasks:

  • Identify various data sources
  • Collect data
  • Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It will be used in further steps.

2. Identifying Data Sources

  • Weather Data:

    This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes global data obtained from the USAF Climatology Center. This dataset covers GSOD data between 1929 and present (updated daily), collected from over 9000 stations.

  • Zone Data:

    It consists of an Shape file that is converted into a CSV file using an online converter and then processed using pandas. It consist of 263 zones in NYC city where the yellow taxi operates.

  • Yellow Taxi Trip Data:

    The NYC yellow taxi data set is avaiable openly on NYC government site.The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs.

3. Data Collection

The yellow taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. It is avaiable as s3 amazon data bucket which was initially difficult to read using a pandas dataframe. Hence the requirement of Pyspark was felt that eased the task of collecting the data from s3 bucket and use the concept of RDD to help process large set of data.

3.1 Data Collection using Pandas

In [0]:
import pandas as pd
df1=pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv")
df2=pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv")
df3=pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv")
df4=pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv")
df5=pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-05.csv")
df6=pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-06.csv")
df7=pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-07.csv")
df8=pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-08.csv")
df9=pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-09.csv")
df10=pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-10.csv")
df11=pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-11.csv")
df12=pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-12.csv")
In [0]:
frames=[df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12]
result = pd.concat(frames)

It was giving Not enough memory error. I decided to go with Pyspark which uses the RDD architecture which consumes less RAM. Also I setup the environment for RDD and also upgraded my RAM TO 25 GB with a standalone cluster on Google Colab.

3.2 Data Collection using Pyspark

2018 Taxi Data Collection

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark
In [0]:
# 2. Setup Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"
In [0]:
import findspark
findspark.init()
In [0]:
from pyspark import SparkContext
from pyspark.sql import HiveContext, SQLContext, Row
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql.functions import col, date_sub, log, mean, to_date, udf, unix_timestamp
from pyspark.sql.window import Window
from pyspark.sql import DataFrame
In [0]:
sc =SparkContext()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
In [0]:
from pyspark import SparkFiles
url1 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-01.csv"
url2 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-02.csv"
url3 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-03.csv"
url4 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-04.csv"
url5 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-05.csv"
url6 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-06.csv"
url7 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-07.csv"
url8 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-08.csv"
url9 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-09.csv"
url10 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-10.csv"
url11 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-11.csv"
url12 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-12.csv"
spark.sparkContext.addFile(url1)
spark.sparkContext.addFile(url2)
spark.sparkContext.addFile(url3)
spark.sparkContext.addFile(url4)
spark.sparkContext.addFile(url5)
spark.sparkContext.addFile(url6)
spark.sparkContext.addFile(url7)
spark.sparkContext.addFile(url8)
spark.sparkContext.addFile(url9)
spark.sparkContext.addFile(url10)
spark.sparkContext.addFile(url11)
spark.sparkContext.addFile(url12)
In [0]:
df1 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2018-01.csv"), header=True, inferSchema= True)
df2 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2018-02.csv"), header=True, inferSchema= True)
df3 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2018-03.csv"), header=True, inferSchema= True)
df4 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2018-04.csv"), header=True, inferSchema= True)
df5 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2018-05.csv"), header=True, inferSchema= True)
df6 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2018-06.csv"), header=True, inferSchema= True)
df7 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2018-07.csv"), header=True, inferSchema= True)
df8 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2018-08.csv"), header=True, inferSchema= True)
df9 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2018-09.csv"), header=True, inferSchema= True)
df10 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2018-10.csv"), header=True, inferSchema= True)
df11 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2018-11.csv"), header=True, inferSchema= True)
df12 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2018-12.csv"), header=True, inferSchema= True)
In [0]:
df_lowerhalf = df1.union(df2)
df_lowerhalf = df_lowerhalf.union(df3)
df_lowerhalf = df_lowerhalf.union(df4)
df_lowerhalf = df_lowerhalf.union(df5)
df_lowerhalf = df_lowerhalf.union(df6)
In [0]:
df_upperhalf = d7.union(df8)
df_upperhalf = df_upperhalf.union(df9)
df_upperhalf = df_upperhalf.union(df10)
df_upperhalf = df_upperhalf.union(df11)
df_upperhalf = df_upperhalf.union(df12)

Copressing and storing in Parquet form to enable faster processing

In [0]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [0]:
print("Shape of 2018 First Half")
print((df_lowerhalf.count(), len(df_lowerhalf.columns)))
row2018_1=df_lowerhalf.count()
col2018_1=df_lowerhalf.columns
Shape of 2018 First Half
(53925735, 17)
In [0]:
print("Shape of 2018 Second Half")
print((df_upperhalf.count(), len(df_upperhalf.columns)))
row2018_2=df_upperhalf.count()
col2018_2=df_upperhalf.columns
Shape of 2018 Second Half
(48878515, 17)
In [0]:
#Creating Parquet files
df_upperhalf.write.parquet("/content/drive/My Drive/Data Science/2018secondhalf.parquet")
df_lowerhalf.write.parquet("/content/drive/My Drive/Data Science/2018firsthalf.parquet")

Now Parquet files can be read into Pandas Dataframe. I did run this command on 25 Gb colab cluster. Please note it is important to specify the file path to parquet file correctly.

In [0]:
import pandas as pd
y2018_part1=pd.read_parquet('/content/drive/My Drive/Data Science/2018firsthalf.parquet')
In [3]:
y2018_part1.shape
Out[3]:
(53925735, 17)
In [0]:
!pip install datalab
In [0]:
from datalab.context import Context

For running this steps you require Google Cloud Credentials with project ID. Note : I am not sharing my credentials as it is associated with my account. But you can run this step after setting up the credentials as shown in :

In [0]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/content/drive/My Drive/Data Science/nyc-taxi-265120-baf9a3e4cf9b.json"
%reload_ext google.cloud.bigquery
In [0]:
#Storing to Google Cloud Platform
project_id = 'nyc-taxi-265120' #@param{type:"string"}
y2018_part1.to_gbq('NYC.2018firsthalf', 
                 Context.default().project_id,
                 chunksize=500000, 
                 if_exists='append',
                 verbose=False
                 )
In [0]:
y2018_part2=pd.read_parquet('/content/drive/My Drive/Data Science/2018secondhalf.parquet')
In [0]:
#Storing to Google Cloud Platform
project_id = 'nyc-taxi-265120' #@param{type:"string"}
y2018_part2.to_gbq('NYC.2018SecondHalf', 
                 Context.default().project_id,
                 chunksize=500000, 
                 if_exists='append',
                 verbose=False
                 )

2019 Taxi Data Collection

In [0]:
url1 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv"
url2 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv"
url3 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv"
url4 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv"
url5 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-05.csv"
url6 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-06.csv"

url7 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-07.csv"
url8 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-08.csv"
url9 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-09.csv"
url10 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-10.csv"
url11 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-11.csv"
url12 = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-12.csv"

spark.sparkContext.addFile(url1)
spark.sparkContext.addFile(url2)
spark.sparkContext.addFile(url3)
spark.sparkContext.addFile(url4)
spark.sparkContext.addFile(url5)
spark.sparkContext.addFile(url6)
spark.sparkContext.addFile(url7)
spark.sparkContext.addFile(url8)
spark.sparkContext.addFile(url9)
spark.sparkContext.addFile(url10)
spark.sparkContext.addFile(url11)
spark.sparkContext.addFile(url12)
In [0]:
df21 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2019-01.csv"), header=True, inferSchema= True)
df22 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2019-02.csv"), header=True, inferSchema= True)
df23 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2019-03.csv"), header=True, inferSchema= True)
df24 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2019-04.csv"), header=True, inferSchema= True)
df25 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2019-05.csv"), header=True, inferSchema= True)
df26 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2019-06.csv"), header=True, inferSchema= True)

df27 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2019-07.csv"), header=True, inferSchema= True)
df28 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2019-08.csv"), header=True, inferSchema= True)
df29 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2019-09.csv"), header=True, inferSchema= True)
df210 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2019-10.csv"), header=True, inferSchema= True)
df211 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2019-11.csv"), header=True, inferSchema= True)
df212 = spark.read.csv("file://"+SparkFiles.get("yellow_tripdata_2019-12.csv"), header=True, inferSchema= True)
In [0]:
df_lowerhalf_2019 = df21.union(df22)
df_lowerhalf_2019 = df_lowerhalf_2020.union(df23)
df_lowerhalf_2019 = df_lowerhalf_2020.union(df24)
df_lowerhalf_2019 = df_lowerhalf_2020.union(df25)
df_lowerhalf_2019 = df_lowerhalf_2020.union(df26)
In [0]:
df_upperhalf_2019 = df27.union(df28)
df_upperhalf_2019 = df_upperhalf_2020.union(df29)
df_upperhalf_2019 = df_upperhalf_2020.union(df210)
df_upperhalf_2019 = df_upperhalf_2020.union(df211)
df_upperhalf_2019 = df_upperhalf_2020.union(df212)
In [0]:
print("Shape of 2019 First Half")
print((df_lowerhalf_2019.count(), len(df_lowerhalf_2019.columns)))
row2019_1=df_lowerhalf_2019.count()
col2019_1=df_lowerhalf_2019.columns
Shape of 2019 First Half
(44459136, 17)
In [0]:
print("Shape of 2019 Second Half")
print((df_lowerhalf.count(), len(df_lowerhalf.columns)))
row2019_2=df_upperhalf_2019.count()
col2019_2=df_upperhalf_2019.columns
Shape of 2019 Second Half
(39939883, 17)
In [0]:
df_upperhalf_2020.write.parquet("/content/drive/My Drive/Data Science/2019upper.parquet")
df_lowerhalf_2020.write.parquet("/content/drive/My Drive/Data Science/2019lower.parquet")
In [0]:
import pandas as pd
y2019upper=pd.read_parquet('/content/drive/My Drive/Data Science/2019upper.parquet')
y2019lower=pd.read_parquet('/content/drive/My Drive/Data Science/2019lower.parquet')
In [0]:
#Storing to Google Cloud Platform
project_id = 'nyc-taxi-265120' #@param{type:"string"}
y2019lower.to_gbq('NYC.2019firsthalf', 
                 Context.default().project_id,
                 chunksize=500000, 
                 if_exists='append',
                 verbose=False
                 )
In [0]:
#Storing to Google Cloud Platform
project_id = 'nyc-taxi-265120' #@param{type:"string"}
y2019upper.to_gbq('NYC.2019secondhalf', 
                 Context.default().project_id,
                 chunksize=500000, 
                 if_exists='append',
                 verbose=False
                 )

Weather Data

In [0]:
project_id = 'nyc-taxi-265120' 
from google.cloud import bigquery
client = bigquery.Client(project = project_id)
In [0]:
from google.cloud import bigquery

client = bigquery.Client(project=project_id)

sample_count = 2000
row_count = client.query('''
  SELECT 
    COUNT(*) as total
  FROM `bigquery-public-data.samples.gsod`''').to_dataframe().total[0]

df = client.query('''
  SELECT
    *
  FROM
    `bigquery-public-data.samples.gsod`
  WHERE RAND() < %d/%d
''' % (sample_count, row_count)).to_dataframe()

print('Full dataset has %d rows' % row_count)
Full dataset has 114420316 rows
In [0]:
df.head()
Out[0]:
station_number wban_number year month day mean_temp num_mean_temp_samples mean_dew_point num_mean_dew_point_samples mean_sealevel_pressure num_mean_sealevel_pressure_samples mean_station_pressure num_mean_station_pressure_samples mean_visibility num_mean_visibility_samples mean_wind_speed num_mean_wind_speed_samples max_sustained_wind_speed max_gust_wind_speed max_temperature max_temperature_explicit min_temperature min_temperature_explicit total_precipitation snow_depth fog rain snow hail thunder tornado
0 153490 99999 1985 9 26 57.599998 4 NaN NaN NaN NaN NaN NaN 6.7 4.0 3.4 4.0 11.7 NaN 50.400002 True None None 0.08 NaN False False False False False False
1 722025 12849 1994 8 29 80.199997 21 73.800003 21.0 NaN NaN NaN NaN 12.6 21.0 3.0 21.0 9.9 NaN 74.800003 True None None NaN NaN False False False False False False
2 715520 99999 1999 9 2 60.000000 4 49.099998 4.0 NaN NaN NaN NaN NaN NaN 9.3 4.0 15.0 25.1 56.099998 True None None 0.00 NaN False False False False False False
3 333012 99999 2004 7 14 67.300003 15 60.299999 15.0 NaN NaN NaN NaN 7.0 15.0 8.9 15.0 11.7 19.4 62.599998 True None None NaN NaN False False False False False False
4 726502 54832 2009 5 31 54.099998 24 34.700001 24.0 NaN NaN NaN NaN 10.0 24.0 3.6 24.0 11.1 18.1 37.400002 True None None 0.00 NaN False False False False False False

Taxi Zone Data

In [0]:
SELECT COUNT(*) FROM `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom` LIMIT 1000
In [0]:
df = client.query('''SELECT * FROM `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom` ''' ).to_dataframe()
In [0]:
df.head()
Out[0]:
zone_id zone_name borough zone_geom
0 1 Newark Airport EWR POLYGON((-74.1856319999999 40.6916479999999, -...
1 3 Allerton/Pelham Gardens Bronx POLYGON((-73.848596761 40.8716707849999, -73.8...
2 18 Bedford Park Bronx POLYGON((-73.8844286139999 40.8668003789999, -...
3 20 Belmont Bronx POLYGON((-73.8839239579998 40.8644177609999, -...
4 31 Bronx Park Bronx POLYGON((-73.8710017319999 40.8572767429999, -...

4 Integrating all Data on One Platform

Google Cloud Platform : Big Query helps to explore such large datasets easily.

In [0]:
import pandas as pd
import numpy as np
import sys
from google.cloud import bigquery
from google.colab import auth
auth.authenticate_user()
In [0]:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

dataset = ['2018 1st half', '2018 2nd half', '2019 first half', '2019 second half']
rows = [row2018_1,row2018_2,row2019_1,row2019_2]
plt.xlabel('Columns')
plt.ylabel('Rows')
ax.bar(dataset,rows)
plt.show()
In [0]:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
dataset = ['2018 1st half', '2018 2nd half', '2019 first half', '2019 second half']
columns = [col2018_1,col2018_1,col2018_1,col2018_1]
plt.xlabel('Columns')
plt.ylabel('Dataset')
ax.bar(dataset,columns)
plt.show()

All DATA SET ON GOOGLE CLOUD

After using Pyspark and pandas togbq function I uploaded my entire dataset on Google Cloud Platform.The screenshot below depicts the entire Dataset information that I used in my Project:

  1. Schema For Taxi Data set in Google Cloud:

  2. Table Information for Yellow Taxi Data

  3. Table & Schema Information for Zones in NYC city

  4. Table Inforamtion

    Schema Inforamtion
  5. Table Information for Weather Data

Understanding Discovery: Discovery is an information-gathering process meant to dig deep into the details of what is important to a client's business, target audience, and industry. The more information you gather, interpret, and comprehend, the more prepared you will be to execute a site on budget and on target. This step was essential to determine the following :

  • Drafting a business problem statement.
  • Reframing the problem as an analytics
  • challenge.
  • Assess resource needs and availability.
  • Draft an analytic plan.
Scope and depth of research and inquiry will differ from project to project, but the results are the same: valuable data.

temp-158716187257738393

Open In Colab

Data Preparation

Prepared By: Simranjeet Randhawa

In [0]:
from google.colab import drive
drive.mount('/content/drive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
In [0]:
!pip install datalab
In [0]:
from datalab.context import Context

Establishing an Analytic Sandbox


I made use of the Google Cloud platform that provides with the BigQuery sandbox. The BigQuery sandbox gives you free access to the power of BigQuery subject to the sandbox's limits. The sandbox allows you to use the web UI in the Cloud Console without providing a credit card. You can use the sandbox without creating a billing account or enabling billing for your project. Big Query provides the ability to deal with big data using SQL like queries. In Hadoop, I used to use Hive to get back the result. So I decided to use Big Query in this project which gives the ability of ETL.

In [0]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/content/drive/My Drive/Data Science/nyc-taxi-265120-baf9a3e4cf9b.json"
%reload_ext google.cloud.bigquery
In [0]:
project_id = 'nyc-taxi-265120' 
from google.cloud import bigquery
client = bigquery.Client(project = project_id)
In [0]:
from google.cloud import bigquery
client = bigquery.Client(project=project_id)

1. Data Preparation

Data preparation is termed as the process of cleaning the data so that the data can be transformed in a form that can be processed and analysed. It is an important step prior to processing and often involves reformatting data, making corrections to data and the combining of data sets to enrich data.

1.1 Data Cleaning

2018 First Half Yellow Cab Data Cleaning (Big Query)

  1. Analyse if there is rows with total trip amount =0 ?
  2. Is there rows with pick up time > drop off time.
  3. Analyse rows which have 0 passenger count.

In [0]:
# if numpy is not installed already : pip3 install numpy
import numpy as np
# matplotlib: used to plot graphs
import matplotlib
matplotlib.use('nbagg')
import matplotlib.pylab as plt
import seaborn as sns#Plots
from matplotlib import rcParams

Finding the number of trips where amount given by passenger is less than zero.

In [0]:
query = """
SELECT tpep_pickup_datetime,	tpep_dropoff_datetime,	passenger_count,	trip_distance,PULocationID,	DOLocationID,fare_amount FROM `nyc-taxi-265120.NYC.2018firsthalf` where total_amount < 0 
"""
df_head = client.query(query).to_dataframe()
In [0]:
df_head.head()
Out[0]:
tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance PULocationID DOLocationID fare_amount
0 2018-01-11 09:55:06+00:00 2018-01-11 09:55:36+00:00 2 0.12 42 42 -2.5
1 2018-01-11 14:15:16+00:00 2018-01-11 14:15:30+00:00 2 0.06 24 43 -2.5
2 2018-01-12 10:07:42+00:00 2018-01-12 10:08:39+00:00 3 0.14 33 33 -2.5
3 2018-01-12 14:04:39+00:00 2018-01-12 14:05:09+00:00 1 0.06 260 260 -2.5
4 2018-01-12 15:00:44+00:00 2018-01-12 15:01:00+00:00 1 0.00 261 261 -2.5
In [0]:
print("Number of rows with amount of trip <0:")
print(df_head.shape[0])
Number of rows with amount of trip <0:
29860
In [0]:
query = """
SELECT * FROM `nyc-taxi-265120.NYC.2018firsthalf` LIMIT 10000
"""
df_head = client.query(query).to_dataframe()
In [0]:
df_head.describe()
Out[0]:
VendorID passenger_count trip_distance RatecodeID PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
mean 1.504900 1.489400 0.388537 1.037200 168.243000 170.657800 1.702600 3.090250 0.416830 0.482600 0.519640 0.036067 0.291120 4.837872
std 0.500001 1.182044 1.429983 1.034274 69.474621 69.068742 0.647762 0.807586 0.413448 0.118948 2.876293 0.885342 0.068232 3.204282
min 1.000000 0.000000 0.000000 1.000000 4.000000 4.000000 1.000000 -3.000000 -1.000000 -0.500000 -3.000000 0.000000 -0.300000 -6.300000
25% 1.000000 1.000000 0.130000 1.000000 125.000000 132.000000 1.000000 3.000000 0.000000 0.500000 0.000000 0.000000 0.300000 4.300000
50% 2.000000 1.000000 0.300000 1.000000 163.000000 164.000000 2.000000 3.500000 0.500000 0.500000 0.000000 0.000000 0.300000 4.800000
75% 2.000000 1.000000 0.400000 1.000000 236.000000 236.000000 2.000000 3.500000 1.000000 0.500000 0.950000 0.000000 0.300000 5.350000
max 2.000000 6.000000 47.400000 99.000000 265.000000 265.000000 4.000000 3.500000 1.300000 0.500000 146.000000 58.000000 0.300000 149.300000

Histogram

In [0]:
#Getting to know there are negative in the price given by passenger from sample of dataset only(For exact count I used BigQuery as the data cannot be analysed using pandas dataframe)
%matplotlib inline
import matplotlib.pyplot as plt # visualization
import seaborn as sns
f, ax = plt.subplots(figsize=(8, 6))
df_head.fare_amount.hist()
plt.show()

There are 29860 rows in 2018 First half dataset where amount trip is less than 0 which is not possible as there is no online wallet system that NYC city taxi used in 2018 nor 2019. Hence this must be the possible outliers or there must be erroneous entry in the amount column. I decided to normalise the value and keep it.

Trip distance is greater than zero and this Hexbin plot gives me the information that only certain zones have move trip_distance associated to it

HexBin plot (to analyse the relationship between two variable

In [0]:
#Getting to know there are negative in the price given by passenger from sample of dataset only(For exact count I used BigQuery as the data cannot be analysed using pandas dataframe)
%matplotlib inline
import matplotlib.pyplot as plt # visualization
import seaborn as sns
df_head.plot.hexbin(x="trip_distance",y="PULocationID")
plt.show()

Density Plot

In [0]:
#Getting to know there are negative in the price given by passenger from sample of dataset only(For exact count I used BigQuery as the data cannot be analysed using pandas dataframe)
%matplotlib inline
import matplotlib.pyplot as plt # visualization
import seaborn as sns
df_head[["total_amount"]].plot.kde()
plt.show()
In [0]:
#Referred: Profressor Dr. Alireza Manashty
import numpy as np
df_head['Amount_log']=np.log(df_head.total_amount)
df_head[["Amount_log"]][np.isfinite(df_head.Amount_log)].dropna().plot.kde()
plt.title("Log Amount density")
plt.show()
/usr/local/lib/python3.6/dist-packages/pandas/core/series.py:679: RuntimeWarning:

divide by zero encountered in log

/usr/local/lib/python3.6/dist-packages/pandas/core/series.py:679: RuntimeWarning:

invalid value encountered in log

Finding the number of pickups where pickup time is greater than drop off time

In [0]:
query = """
SELECT tpep_pickup_datetime,	tpep_dropoff_datetime,	passenger_count,	trip_distance,PULocationID,	DOLocationID,fare_amount FROM `nyc-taxi-265120.NYC.2018firsthalf` where tpep_pickup_datetime > tpep_dropoff_datetime 
"""
df = client.query(query).to_dataframe()
In [0]:
df.head()
Out[0]:
tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance PULocationID DOLocationID fare_amount
0 2018-01-23 13:12:19+00:00 2018-01-23 00:28:25+00:00 2 20.9 132 244 52.0
1 2018-02-09 09:50:19+00:00 2018-02-09 04:48:28+00:00 1 1.3 90 113 6.0
2 2018-02-09 15:57:17+00:00 2018-02-09 15:14:23+00:00 1 1.9 234 48 14.0
3 2018-02-04 15:15:17+00:00 2018-02-03 16:06:36+00:00 2 20.2 132 151 52.0
4 2018-02-11 15:15:01+00:00 2017-12-24 17:09:53+00:00 2 4.7 246 66 25.0
In [0]:
print("Number of rows where pickup time is greater than drop off:")
print(df.shape[0])
Number of rows where pickup time is greater than drop off:
18

I decided to remove these rows as they are certainty the outliers and will not make much difference

Finding the number of pickups where there are no passengers

In [0]:
query = """
SELECT * FROM `nyc-taxi-265120.NYC.2018firsthalf` limit 10000
"""
df = client.query(query).to_dataframe()
In [0]:
#Getting to know there are negative in the price given by passenger from sample of dataset only(For exact count I used BigQuery as the data cannot be analysed using pandas dataframe)
%matplotlib inline
import matplotlib.pyplot as plt # visualization
import seaborn as sns
f, ax = plt.subplots(figsize=(8, 6))
df.passenger_count.hist()
plt.show()

Now using Big Query finding the extact number of rows

In [0]:
query = """
SELECT tpep_pickup_datetime,	tpep_dropoff_datetime,	passenger_count,	trip_distance,PULocationID,	DOLocationID,fare_amount FROM `nyc-taxi-265120.NYC.2018firsthalf` where passenger_count = 0 
"""
df = client.query(query).to_dataframe()
In [0]:
print("Number of pickups where there are no passengers:")
print(df.shape[0])
Number of pickups where there are no passengers:
402117
In [0]:
#Getting to know there are negative in the price given by passenger from sample of dataset only(For exact count I used BigQuery as the data cannot be analysed using pandas dataframe)
%matplotlib inline
import matplotlib.pyplot as plt # visualization
import seaborn as sns
df_head.plot.hexbin(x="passenger_count",y="PULocationID")
plt.show()

This can be due to the reason the passenger booked and then quicly canceled the trip.

In [0]:
query = """
SELECT  tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,PULocationID,DOLocationID,total_amount FROM `nyc-taxi-265120.NYC.2018firsthalf` LIMIT 10000
"""
df = client.query(query).to_dataframe()

Finding the number of pickups where duration is more than 12 hours

In [0]:
sns.boxplot(y="total_amount", data =df)
plt.show()

Moving onto the Trip duration, according to NYC Taxi & Limousine Commision Regulations the maximum allowed trip duration in a 24 hour interval is 12 hours. So selecting only the data points whose trip times is greater than 1 and less than 720 minutes (12 hours). Given below is the PDF of log-trip-times along with its Normal Q-Q plot which says that the data follows a good normal distribution.

In [0]:
query = """
SELECT  tpep_pickup_datetime,	tpep_dropoff_datetime,	passenger_count,	trip_distance,PULocationID,	DOLocationID,fare_amount FROM `nyc-taxi-265120.NYC.2018firsthalf` WHERE TIMESTAMP_DIFF(tpep_dropoff_datetime,tpep_pickup_datetime,HOUR)>12 
"""
df = client.query(query).to_dataframe()
In [0]:
df.head()
Out[0]:
tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance PULocationID DOLocationID fare_amount
0 2018-01-03 20:45:14+00:00 2018-01-04 20:43:04+00:00 7 0.0 4 4 0.3
1 2018-02-05 16:41:27+00:00 2018-02-06 16:40:01+00:00 9 0.0 186 186 9.0
2 2018-04-03 05:17:07+00:00 2018-04-04 00:00:00+00:00 9 0.0 265 265 92.0
3 2018-03-18 14:58:11+00:00 2018-03-19 14:53:29+00:00 0 0.0 265 265 200.0
4 2018-05-11 09:18:52+00:00 2018-05-12 09:17:43+00:00 7 0.0 132 132 70.7
In [0]:
print(df.shape[0])
104528

2018 Second Half Yellow Cab Data Cleaning (Big Query)

  1. Analyse if there is rows with total trip amount =0 ?
  2. Is there rows with pick up time > drop off time.
  3. Analyse rows which have 0 passenger count.

Finding the number of trips where amount given by passenger is less than zero.

In [0]:
query = """
SELECT tpep_pickup_datetime,	tpep_dropoff_datetime,	passenger_count,	trip_distance,PULocationID,	DOLocationID,fare_amount FROM `nyc-taxi-265120.NYC.2018SecondHalf` where total_amount <= 0 
"""
df = client.query(query).to_dataframe()
In [0]:
df.head()
Out[0]:
tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance PULocationID DOLocationID fare_amount
0 2018-08-04 21:30:09 2018-08-04 22:01:18 1 3.02 232 80 0.0
1 2018-08-05 19:37:02 2018-08-05 19:37:02 1 0.00 129 264 0.0
2 2018-08-06 00:06:13 2018-08-06 00:17:28 1 3.65 66 249 0.0
3 2018-08-06 06:38:01 2018-08-06 06:45:13 5 2.97 65 80 0.0
4 2018-08-04 18:24:14 2018-08-04 18:24:42 2 0.10 133 133 -2.5
In [0]:
print("Number of rows with amount of trip <=0:")
print(df.shape[0])
Number of rows with amount of trip <=0:
45192
In [0]:
query = """
SELECT * FROM `nyc-taxi-265120.NYC.2018SecondHalf` LIMIT 10000
"""
df = client.query(query).to_dataframe()
In [0]:
#Scatter plot that gives the indication we are Zone ID where the total_amount received is less than 0.
df.plot(kind='scatter', x='total_amount', y='PULocationID',alpha = 0.5,color = 'blue')
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbe13de2320>
In [0]:
# histograms
df.hist(figsize=(15,20));
plt.figure();
<Figure size 432x288 with 0 Axes>
In [0]:
# scatter plot matrix
pd.plotting.scatter_matrix(df,figsize=(20,20))
plt.figure();
<Figure size 432x288 with 0 Axes>

Finding the number of pickups where pickup time is greater than drop off time

In [0]:
query = """
SELECT count(*) FROM `nyc-taxi-265120.NYC.2018SecondHalf` where tpep_pickup_datetime > tpep_dropoff_datetime 
"""
df = client.query(query).to_dataframe()

print("Number of rows where pickup time is greater than drop off:")
print(df["f0_"][0])
Number of rows where pickup time is greater than drop off:
1026

Finding the number of pickups where there are no passengers

In [0]:
query = """
SELECT count(*) FROM `nyc-taxi-265120.NYC.2018SecondHalf` where passenger_count = 0 
"""
df = client.query(query).to_dataframe()

print("Number of pickups where there are no passengers:")
print(df["f0_"][0])
Number of pickups where there are no passengers:
530950

2019 First Half Yellow Cab Data Cleaning (Big Query)

  1. Analyse if there is rows with total trip amount =0 ?
  2. Is there rows with pick up time > drop off time.
  3. Analyse rows which have 0 passenger count.

In [0]:
query = """
SELECT * FROM `nyc-taxi-265120.NYC.2019firsthalf` LIMIT 10000
"""
df = client.query(query).to_dataframe()
In [0]:
# histograms
df.hist(figsize=(15,20));
plt.figure();
<Figure size 432x288 with 0 Axes>
In [0]:
# scatter plot matrix
pd.plotting.scatter_matrix(df,figsize=(20,20))
plt.figure();
<Figure size 432x288 with 0 Axes>
In [0]:
query = """
SELECT tpep_pickup_datetime,	tpep_dropoff_datetime,	passenger_count,	trip_distance,PULocationID,	DOLocationID,fare_amount FROM `nyc-taxi-265120.NYC.2019firsthalf` where total_amount < 0 
"""
df = client.query(query).to_dataframe()
In [0]:
df.head()
Out[0]:
tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance PULocationID DOLocationID fare_amount
0 2019-01-03 17:56:33 2019-01-03 17:56:41 1 0.05 166 166 -2.5
1 2019-01-03 17:09:38 2019-01-03 17:10:03 1 0.18 157 160 -2.5
2 2019-01-04 16:06:11 2019-01-04 16:06:25 1 0.00 152 152 -2.5
3 2019-01-04 16:25:35 2019-01-04 16:25:46 1 0.00 197 197 -2.5
4 2019-01-04 17:21:52 2019-01-04 17:21:57 1 0.00 226 226 -2.5
In [0]:
print("Number of rows with amount of trip <0:")
print(df.shape[0])
Number of rows with amount of trip <0:
67963
In [0]:
query = """
SELECT tpep_pickup_datetime,	tpep_dropoff_datetime,	passenger_count,	trip_distance,PULocationID,	DOLocationID,fare_amount FROM `nyc-taxi-265120.NYC.2019firsthalf` where tpep_pickup_datetime > tpep_dropoff_datetime LIMIT 10000
"""
df = client.query(query).to_dataframe()

print("Number of rows where pickup time is greater than drop off:")
print(df.shape[0])
Number of rows where pickup time is greater than drop off:
18
In [0]:
query = """
SELECT count(*) FROM `nyc-taxi-265120.NYC.2019firsthalf` where passenger_count = 0 LIMIT 10000
"""
df = client.query(query).to_dataframe()
print("Number of pickups where there are no passengers:")
print(df["f0_"][0])
Number of pickups where there are no passengers:
785404

2019 Second Half Yellow Cab Data Cleaning (Big Query)

  1. Analyse if there is rows with total trip amount =0 ?
  2. Is there rows with pick up time > drop off time.
  3. Analyse rows which have 0 passenger count.

In [0]:
#Visulizing sample of Data Points
query = """
SELECT * FROM `nyc-taxi-265120.NYC.2019secondhalf` LIMIT 10000
"""
df = client.query(query).to_dataframe()
In [0]:
# histograms
df.hist(figsize=(15,20));
plt.figure();
<Figure size 432x288 with 0 Axes>
In [0]:
# scatter plot matrix
pd.plotting.scatter_matrix(df,figsize=(20,20))
plt.figure();
<Figure size 432x288 with 0 Axes>

Joint Plot used to study the distribution

In [0]:
sns.jointplot(x='trip_distance',y='PULocationID' ,data=df, kind='reg');
In [0]:
sns.jointplot(x='total_amount',y='PULocationID' ,data=df);
In [0]:
query = """
SELECT tpep_pickup_datetime,	tpep_dropoff_datetime,	passenger_count,	trip_distance,PULocationID,	DOLocationID,fare_amount FROM `nyc-taxi-265120.NYC.2019secondhalf` where total_amount <= 0 
"""
df = client.query(query).to_dataframe()
In [0]:
df.head()
Out[0]:
tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance PULocationID DOLocationID fare_amount
0 2019-07-01 07:04:53 2019-07-01 07:04:53 1.0 0.0 77.0 264.0 0.0
1 2019-07-01 09:10:24 2019-07-01 09:10:24 1.0 0.0 63.0 264.0 0.0
2 2019-07-01 09:56:06 2019-07-01 09:59:07 1.0 0.0 207.0 207.0 0.0
3 2019-07-01 11:33:44 2019-07-01 11:33:44 1.0 0.0 205.0 264.0 0.0
4 2019-07-01 12:47:33 2019-07-01 12:48:34 1.0 0.0 207.0 207.0 0.0
In [0]:
print("Number of rows with amount of trip <=0:")
print(df.shape[0])
Number of rows with amount of trip <=0:
112402
In [0]:
query = """
SELECT * FROM `nyc-taxi-265120.NYC.2019secondhalf` where tpep_pickup_datetime > tpep_dropoff_datetime 
"""
df = client.query(query).to_dataframe()

print("Number of rows where pickup time is greater than drop off:")
print(df["f0_"][0])
In [0]:
query = """
SELECT count(*) FROM `nyc-taxi-265120.NYC.2019secondhalf` where passenger_count = 0 
"""
df = client.query(query).to_dataframe()
print("Number of pickups where there are no passengers:")
print(df["f0_"][0])

INNER JOIN (Eliminating out of Bounds Location if exists)

Eliminaing out of Bound Location with the help of INNER JOIN.

  1. Table A- Taxi Trips Data
  2. Table B- Zone Data

The INNER JOIN keyword selects all rows from both tables as long as there is a match between the columns. It gives only the trips pickups in the NYC region.</p>

In [0]:
#Sample showing how INNER JOIN can be done using Big Query.
query = """
with td as(
WITH AB AS
(
SELECT  CAST(tpep_pickup_datetime AS DATETIME) AS PICKUP,* FROM `nyc-taxi-265120.NYC.2019firsthalf` where total_amount>0
)  
SELECT  PICKUP,CAST(EXTRACT (YEAR from PICKUP) AS STRING) AS year,EXTRACT (DAYOFYEAR from PICKUP) AS daynumber,EXTRACT (HOUR from PICKUP) AS hour, cast(PULocationID as STRING) as LOCATION,COUNT(*) AS numtrips FROM AB group by PICKUP,year,daynumber,hour,LOCATION),points AS
(SELECT *,ST_CENTROID(zone_geom) as p FROM `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom`)
SELECT cast(DATE(td.PICKUP) as STRING) AS A1,
points.zone_id,CAST(TIME(DATETIME_TRUNC(td.PICKUP, HOUR)) as string) rounded_to_hour,
SUM(td.numtrips) as label,
FROM  td 

INNER JOIN 

points ON points.zone_id=td.LOCATION AND
eXTRACT (YEAR from PICKUP) =2018 AND td.daynumber<150
group by zone_id,A1,rounded_to_hour
"""
df1 = client.query(query).to_dataframe()

5.2 Data Wrangling

Data wrangling which is also referred as data munging.The main moto transform data into a format that is more appropriate and considered more valuable.

2018 Data Wrangling

Add columns: year, month, weekday name, and hour using BIg Query

In [0]:
query = """
with td as(
WITH AB AS
(
SELECT  CAST(tpep_pickup_datetime AS DATETIME) AS PICKUP,* FROM `nyc-taxi-265120.NYC.2018firsthalf` where total_amount>0 or  tpep_pickup_datetime < tpep_dropoff_datetime or  TIMESTAMP_DIFF(tpep_dropoff_datetime,tpep_pickup_datetime,HOUR)>12 
)  
SELECT  PICKUP,CAST(EXTRACT (YEAR from PICKUP) AS STRING) AS year,EXTRACT (DAYOFYEAR from PICKUP) AS daynumber,EXTRACT (HOUR from PICKUP) AS hour, cast(PULocationID as STRING) as LOCATION,COUNT(*) AS numtrips FROM AB group by PICKUP,year,daynumber,hour,LOCATION),points AS
(SELECT *,ST_CENTROID(zone_geom) as p FROM `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom`)
SELECT 
cast(DATE(td.PICKUP) as STRING) AS A1,
points.zone_id,
CAST(TIME(DATETIME_TRUNC(td.PICKUP, HOUR)) as string) rounded_to_hour,
SUM(td.numtrips) as label,
FROM  td INNER JOIN points ON points.zone_id=td.LOCATION AND
eXTRACT (YEAR from PICKUP) =2018 
group by zone_id,A1,rounded_to_hour
"""
df1 = client.query(query).to_dataframe()
In [0]:
import pandas as pd
df1["time"]=pd.to_datetime(df1['A1'] + ' ' + df1['rounded_to_hour'])

Binning so that I can get a window of 1 hr for each zone and for each date

A1 column is th pickup date. I took the help of pivot and then unstack the pivot table to get the regular 1 hr window for each zone on each pickup day.

In [0]:
df1.head()
Out[0]:
A1 zone_id rounded_to_hour label time
0 2018-01-11 234 00:00:00 230 2018-01-11
1 2018-01-10 45 00:00:00 28 2018-01-10
2 2018-01-11 263 00:00:00 77 2018-01-11
3 2018-01-11 246 00:00:00 75 2018-01-11
4 2018-01-10 113 00:00:00 52 2018-01-10
In [0]:
#_____Converting dataframe to Pivot Table______

a=df1.pivot_table("label", "time", "zone_id")
In [0]:
a.head()
Out[0]:
zone_id 1 10 100 101 102 106 107 108 109 11 110 111 112 113 114 115 116 117 118 119 12 120 121 122 123 124 125 126 127 128 129 13 130 131 132 133 134 135 136 137 ... 62 63 64 65 66 67 68 69 7 70 71 72 73 74 75 76 77 78 79 8 80 81 82 83 84 85 86 87 88 89 9 90 91 92 93 94 95 96 97 98
time
2018-01-01 00:00:00 NaN NaN 107.0 NaN NaN 3.0 474.0 NaN NaN NaN NaN NaN 15.0 243.0 346.0 NaN 59.0 NaN NaN 2.0 NaN NaN NaN NaN NaN NaN 56.0 NaN 8.0 NaN 15.0 63.0 2.0 NaN 173.0 2.0 1.0 1.0 NaN 252.0 ... NaN NaN NaN 28.0 10.0 NaN 494.0 1.0 40.0 3.0 NaN NaN NaN 72.0 118.0 NaN NaN 2.0 749.0 NaN 24.0 NaN 5.0 3.0 NaN NaN NaN 103.0 39.0 2.0 NaN 337.0 NaN 1.0 1.0 1.0 1.0 NaN 5.0 NaN
2018-01-01 01:00:00 NaN 1.0 169.0 NaN NaN 16.0 661.0 NaN NaN NaN NaN NaN 65.0 236.0 207.0 NaN 78.0 NaN NaN 5.0 1.0 1.0 NaN NaN 2.0 NaN 111.0 3.0 10.0 NaN 37.0 94.0 NaN NaN 133.0 1.0 1.0 1.0 1.0 300.0 ... 3.0 NaN NaN 35.0 11.0 NaN 508.0 3.0 99.0 4.0 NaN NaN NaN 119.0 153.0 2.0 NaN 1.0 818.0 NaN 72.0 NaN 15.0 11.0 NaN 2.0 NaN 223.0 50.0 8.0 1.0 327.0 1.0 1.0 NaN 1.0 5.0 NaN 40.0 NaN
2018-01-01 02:00:00 NaN NaN 210.0 NaN 1.0 9.0 497.0 NaN NaN NaN NaN NaN 64.0 174.0 217.0 NaN 75.0 NaN NaN 1.0 3.0 NaN NaN NaN NaN NaN 106.0 NaN 12.0 NaN 32.0 42.0 1.0 NaN 33.0 6.0 NaN NaN 3.0 275.0 ... 6.0 NaN 1.0 35.0 12.0 1.0 391.0 2.0 133.0 6.0 NaN 1.0 NaN 112.0 134.0 2.0 NaN 1.0 963.0 NaN 78.0 NaN 31.0 23.0 NaN 1.0 NaN 149.0 52.0 5.0 NaN 254.0 NaN 1.0 NaN 2.0 7.0 NaN 27.0 NaN
2018-01-01 03:00:00 1.0 1.0 187.0 NaN NaN 9.0 390.0 NaN NaN NaN NaN NaN 40.0 134.0 332.0 NaN 56.0 NaN NaN 5.0 4.0 1.0 NaN NaN 2.0 NaN 87.0 3.0 13.0 NaN 69.0 19.0 1.0 NaN 26.0 1.0 NaN NaN 4.0 214.0 ... 1.0 NaN NaN 18.0 1.0 NaN 411.0 5.0 106.0 1.0 1.0 NaN NaN 79.0 65.0 2.0 1.0 2.0 973.0 NaN 72.0 NaN 16.0 15.0 NaN 1.0 NaN 73.0 39.0 1.0 NaN 274.0 NaN 1.0 NaN 1.0 5.0 NaN 20.0 1.0
2018-01-01 04:00:00 3.0 1.0 194.0 1.0 NaN 8.0 283.0 NaN NaN 1.0 NaN NaN 28.0 83.0 228.0 NaN 44.0 NaN 1.0 1.0 NaN NaN 1.0 NaN NaN NaN 64.0 NaN 10.0 NaN 56.0 17.0 NaN NaN 40.0 1.0 NaN NaN 3.0 133.0 ... NaN NaN NaN 15.0 5.0 NaN 293.0 1.0 106.0 4.0 1.0 1.0 NaN 59.0 65.0 1.0 1.0 NaN 566.0 NaN 59.0 NaN 25.0 25.0 NaN NaN NaN 29.0 29.0 2.0 NaN 162.0 NaN 1.0 NaN NaN 1.0 NaN 10.0 1.0

5 rows × 258 columns

In [0]:
#__________Unstacking the Pivot Table___________

b=df1.pivot_table("label", "time", "zone_id").unstack().reset_index()
In [0]:
b.head()
Out[0]:
zone_id time 0
0 1 2018-01-01 00:00:00 NaN
1 1 2018-01-01 01:00:00 NaN
2 1 2018-01-01 02:00:00 NaN
3 1 2018-01-01 03:00:00 1.0
4 1 2018-01-01 04:00:00 3.0

Filling the NaN with zero pickups

In [0]:
b.rename( columns={0:'pickups'}, inplace=True )
b1=b.fillna(0)
In [0]:
b1.head()
Out[0]:
zone_id time pickups
0 1 2018-01-01 00:00:00 0.0
1 1 2018-01-01 01:00:00 0.0
2 1 2018-01-01 02:00:00 0.0
3 1 2018-01-01 03:00:00 1.0
4 1 2018-01-01 04:00:00 3.0

Storing the 2018 first half prepared data on Google cloud.

In [0]:
from datalab.context import Context
#Dont run this

b1.to_gbq('hello.2018DataFirsthalf', 
                 Context.default().project_id,
                 chunksize=500000, 
                 if_exists='append',
                 verbose=False
                 )
/usr/local/lib/python3.6/dist-packages/pandas_gbq/gbq.py:1127: FutureWarning: verbose is deprecated and will be removed in a future version. Set logging level in order to vary verbosity
  stacklevel=1,
3it [00:56, 15.09s/it]

Similarly preparing the data set for 2nd Half of 2018

In [0]:
query = """
with td as(
WITH AB AS
(
SELECT  CAST(tpep_pickup_datetime AS DATETIME) AS PICKUP,* FROM `nyc-taxi-265120.NYC.2018SecondHalf` where total_amount>0 or  tpep_pickup_datetime < tpep_dropoff_datetime or TIMESTAMP_DIFF(tpep_dropoff_datetime,tpep_pickup_datetime,HOUR)>12 
)  
SELECT  PICKUP,CAST(EXTRACT (YEAR from PICKUP) AS STRING) AS year,EXTRACT (DAYOFYEAR from PICKUP) AS daynumber,EXTRACT (HOUR from PICKUP) AS hour, cast(PULocationID as STRING) as LOCATION,COUNT(*) AS numtrips FROM AB group by PICKUP,year,daynumber,hour,LOCATION),points AS
(SELECT *,ST_CENTROID(zone_geom) as p FROM `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom`)
SELECT 
cast(DATE(td.PICKUP) as STRING) AS A1,
points.zone_id,
CAST(TIME(DATETIME_TRUNC(td.PICKUP, HOUR)) as string) rounded_to_hour,
SUM(td.numtrips) as label,
FROM  td INNER JOIN points ON points.zone_id=td.LOCATION AND
eXTRACT (YEAR from PICKUP) =2018 
group by zone_id,A1,rounded_to_hour
"""
df1 = client.query(query).to_dataframe()
In [0]:
import pandas as pd
df1["time"]=pd.to_datetime(df1['A1'] + ' ' + df1['rounded_to_hour'])
a=df1.pivot_table("label", "time", "zone_id")
b=df1.pivot_table("label", "time", "zone_id").unstack().reset_index()
b.rename( columns={0:'pickups'}, inplace=True )
b1=b.fillna(0)
In [0]:
from datalab.context import Context
#Dont run this

b1.to_gbq('hello.2018DataSecondhalf', 
                 Context.default().project_id,
                 chunksize=500000, 
                 if_exists='append',
                 verbose=False
                 )
/usr/local/lib/python3.6/dist-packages/pandas_gbq/gbq.py:1127: FutureWarning: verbose is deprecated and will be removed in a future version. Set logging level in order to vary verbosity
  stacklevel=1,
3it [00:43, 11.34s/it]

2019 Data Wrangling

Same Data preparation steps applied as for 2018 Data set

After finding the outliers as above I decided to remove only those rows in which all the three conditions were not satified that is:

  • Total amount<=0
  • Pickup DateTime is greater than Drop off time
  • The duration of ride is more than 12 hours
If all three are incorrect that means the trip is not possible at all. By using BIg query I dicided to take the intersection of the three consition and consider only the valid pickups.

In [0]:
query = """
with td as(
WITH AB AS
(
SELECT  CAST(tpep_pickup_datetime AS DATETIME) AS pickuptime,CAST(tpep_dropoff_datetime AS DATETIME) AS dropofftime,  CAST(tpep_pickup_datetime AS DATETIME) AS PICKUP,* FROM `nyc-taxi-265120.NYC.2019firsthalf` where total_amount>0 or  tpep_pickup_datetime < tpep_dropoff_datetime or  TIMESTAMP_DIFF(pickuptime,dropofftime,HOUR)>12 
)  
SELECT  PICKUP,CAST(EXTRACT (YEAR from PICKUP) AS STRING) AS year,EXTRACT (DAYOFYEAR from PICKUP) AS daynumber,EXTRACT (HOUR from PICKUP) AS hour, cast(PULocationID as STRING) as LOCATION,COUNT(*) AS numtrips FROM AB group by PICKUP,year,daynumber,hour,LOCATION),points AS
(SELECT *,ST_CENTROID(zone_geom) as p FROM `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom`)
SELECT 
cast(DATE(td.PICKUP) as STRING) AS A1,
points.zone_id,
CAST(TIME(DATETIME_TRUNC(td.PICKUP, HOUR)) as string) rounded_to_hour,
SUM(td.numtrips) as label,
FROM  td INNER JOIN points ON points.zone_id=td.LOCATION AND
eXTRACT (YEAR from PICKUP) =2019 
group by zone_id,A1,rounded_to_hour
"""
df1 = client.query(query).to_dataframe()
In [0]:
import pandas as pd
df1["time"]=pd.to_datetime(df1['A1'] + ' ' + df1['rounded_to_hour'])
a=df1.pivot_table("label", "time", "zone_id")
b=df1.pivot_table("label", "time", "zone_id").unstack().reset_index()
b.rename( columns={0:'pickups'}, inplace=True )
b1=b.fillna(0)
In [0]:
from datalab.context import Context
#Dont run this

b1.to_gbq('hello.2019DataFirsthalf', 
                 Context.default().project_id,
                 chunksize=500000, 
                 if_exists='append',
                 verbose=False
                 )
/usr/local/lib/python3.6/dist-packages/pandas_gbq/gbq.py:1127: FutureWarning: verbose is deprecated and will be removed in a future version. Set logging level in order to vary verbosity
  stacklevel=1,
3it [00:28,  8.09s/it]
In [0]:
query = """
with td as(
WITH AB AS
(
SELECT  CAST(tpep_pickup_datetime AS DATETIME) AS PICKUP,* FROM `nyc-taxi-265120.NYC.2019secondhalf` where total_amount>0 or  tpep_pickup_datetime < tpep_dropoff_datetime or  TIMESTAMP_DIFF(tpep_dropoff_datetime,tpep_pickup_datetime,HOUR)>12 
)  
SELECT  PICKUP,CAST(EXTRACT (YEAR from PICKUP) AS STRING) AS year,EXTRACT (DAYOFYEAR from PICKUP) AS daynumber,EXTRACT (HOUR from PICKUP) AS hour, cast(PULocationID as STRING) as LOCATION,COUNT(*) AS numtrips FROM AB group by PICKUP,year,daynumber,hour,LOCATION),points AS
(SELECT *,ST_CENTROID(zone_geom) as p FROM `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom`)
SELECT 
cast(DATE(td.PICKUP) as STRING) AS A1,
points.zone_id,
CAST(TIME(DATETIME_TRUNC(td.PICKUP, HOUR)) as string) rounded_to_hour,
SUM(td.numtrips) as label,
FROM  td INNER JOIN points ON points.zone_id=td.LOCATION AND
eXTRACT (YEAR from PICKUP) =2019 
group by zone_id,A1,rounded_to_hour
"""
df1 = client.query(query).to_dataframe()
In [0]:
df1.head()
Out[0]:
A1 zone_id rounded_to_hour label
0 2019-07-16 233 00:00:00 34
1 2019-10-31 87 00:00:00 64
2 2019-07-16 141 00:00:00 48
3 2019-07-17 90 00:00:00 94
4 2019-07-17 232 00:00:00 14
In [0]:
import pandas as pd
df1["time"]=pd.to_datetime(df1['A1'] + ' ' + df1['rounded_to_hour'])
a=df1.pivot_table("label", "time", "zone_id")
b=df1.pivot_table("label", "time", "zone_id").unstack().reset_index()
b.rename( columns={0:'pickups'}, inplace=True )
b1=b.fillna(0)
In [0]:
from datalab.context import Context
#Dont run this

b1.to_gbq('hello.2019DataSecondhalf', 
                 Context.default().project_id,
                 chunksize=500000, 
                 if_exists='append',
                 verbose=False
                 )
/usr/local/lib/python3.6/dist-packages/pandas_gbq/gbq.py:1127: FutureWarning: verbose is deprecated and will be removed in a future version. Set logging level in order to vary verbosity
  stacklevel=1,
3it [00:37, 10.32s/it]

Taxi Zone Data Visulizations

In [0]:
query = """
WITH ZONE AS (
SELECT *,ST_CENTROID(zone_geom) as p FROM `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom`
)
SELECT * ,ST_X(p) as longitude,
  ST_Y(p) as latitude FROM ZONE

"""
df = client.query(query).to_dataframe()
In [0]:
import dask.dataframe as dd
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import folium
from datetime import datetime
import time
import seaborn as sns
import os
import math
import warnings

Figuring that all Latitute and Longitute for Taxi zone are in NYC city

In [0]:
#detecting the pickups latitude and longitudes which are outside NYC.
outside_NYC = df[(df.latitude != 0) | (df.longitude != 0) ]
m = folium.Map(location = [40.5774, -73.7004], tiles = "Stamen Toner")
outside_pickups = outside_NYC.head(25000)
for i,j in outside_pickups.iterrows():
    if j["latitude"] != 0:
        folium.Marker(location=[j["latitude"], j["longitude"]],popup=j["zone_name"]).add_to(m)
m
Out[0]:
In [0]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/content/drive/My Drive/Data Science/nyc-taxi-265120-baf9a3e4cf9b.json"
%reload_ext google.cloud.bigquery
In [0]:
project_id = 'nyc-taxi-265120' 
from google.cloud import bigquery
client = bigquery.Client(project = project_id)

Combining all the Prepared Data (2019 and 2018 First and second half Both)

In [0]:
##_________Getting First Half 2019 Data _________________##

query = """
SELECT  * FROM `hello.2019DataFirsthalf`
"""
first_2019 = client.query(query).to_dataframe()
In [0]:
##_________Getting Second Half 2019 Data _________________##

query = """
SELECT  * FROM `hello.2019DataSecondhalf`
"""
second_2019 = client.query(query).to_dataframe()
In [0]:
##_________Getting Second Half 2018 Data _________________##

query = """
SELECT  * FROM `hello.2018DataSecondhalf`
"""
second_2018 = client.query(query).to_dataframe()
In [0]:
##_________Getting First Half 2018 Data _________________##

query = """
SELECT  * FROM `hello.2018DataFirsthalf`
"""
first_2018 = client.query(query).to_dataframe()
In [0]:

In [0]:
##_________Concating all Data Together _________________##

import pandas as pd
concat=pd.concat([first_2018,second_2018,second_2019,first_2019],ignore_index=True)
In [0]:
!pip install datalab
In [0]:
## _______________STORING PREPARED DATA TO GOOGLE CLOUD_____________________##

from datalab.context import Context
concat.to_gbq('hello.CombinedData', 
                 Context.default().project_id,
                 chunksize=500000, 
                 if_exists='append',
                 verbose=False
                 )
/usr/local/lib/python3.6/dist-packages/pandas_gbq/gbq.py:1127: FutureWarning: verbose is deprecated and will be removed in a future version. Set logging level in order to vary verbosity
  stacklevel=1,
10it [02:25, 14.71s/it]

Finally Data is prepared in a format that can be modelled and explored easily. Also the outliers from Data has been removed by using Big Query.The following has been done in Data Preparation phse:

  • Established the analytic sandbox
  • Extract, Transform, Load, and Transform (ETLT)
  • Thorough Data exploration done
  • Done Data conditioning (merging)
  • Handled outliers/Missing data using the strategies given in 100 pages of Machine Learning Book
  • Summarized and visualized the data

temp-158716192670373082

Open In Colab

Model Planning

Prepared By: Simranjeet Randhawa

1. Model Planning

The key acticities in this phase is model selectiona and variable selection.Data Exploration is the phase where one tries to understand the data in hand and how the different variables interact between each other. In Machine Learning, Data Exploration always precede the creation of the predictive model as it allows us to come up with ideas in order to increase the models’ performances.

In [0]:
from google.colab import drive
drive.mount('/content/drive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
In [0]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/content/drive/My Drive/Data Science/nyc-taxi-265120-baf9a3e4cf9b.json"
%reload_ext google.cloud.bigquery
In [0]:
project_id = 'nyc-taxi-265120' 
from google.cloud import bigquery
client = bigquery.Client(project = project_id)

2. Model Planning - Feature Engineering

In model planning,the methods, techniques, and workflow is being determined. In this phase the data is explored to learn about the relationships between variables and subsequently selects key variables and the most suitable models.

In [0]:
query = """
SELECT  * FROM `hello.CombinedData`
"""
data = client.query(query).to_dataframe()

One Hot Encoding

The zone_id is a categorical variable. One hot encoding can be used to align zone_id column wise. It will be helpful to train our model

In [0]:
data.head()
Out[0]:
zone_id time pickups
0 107 2019-07-01 03:00:00+00:00 13.0
1 107 2019-07-09 03:00:00+00:00 13.0
2 107 2019-08-07 03:00:00+00:00 13.0
3 107 2019-09-03 02:00:00+00:00 13.0
4 107 2019-09-03 04:00:00+00:00 13.0
In [0]:
#______ONE HOT ENCODING____

import pandas as pd
one_hot = pd.get_dummies(data['zone_id'])
In [0]:
one_hot.head()
Out[0]:
1 10 100 101 102 106 107 108 109 11 110 111 112 113 114 115 116 117 118 119 12 120 121 122 123 124 125 126 127 128 129 13 130 131 132 133 134 135 136 137 ... 63 64 65 66 67 68 69 7 70 71 72 73 74 75 76 77 78 79 8 80 81 82 83 84 85 86 87 88 89 9 90 91 92 93 94 95 96 97 98 99
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 259 columns

In [0]:
data = data.drop('zone_id',axis = 1)
# Join the encoded df
data = data.join(one_hot)
In [0]:
data.head()
Out[0]:
time pickups 1 10 100 101 102 106 107 108 109 11 110 111 112 113 114 115 116 117 118 119 12 120 121 122 123 124 125 126 127 128 129 13 130 131 132 133 134 135 ... 63 64 65 66 67 68 69 7 70 71 72 73 74 75 76 77 78 79 8 80 81 82 83 84 85 86 87 88 89 9 90 91 92 93 94 95 96 97 98 99
0 2019-07-01 03:00:00+00:00 13.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2019-07-09 03:00:00+00:00 13.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2019-08-07 03:00:00+00:00 13.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 2019-09-03 02:00:00+00:00 13.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 2019-09-03 04:00:00+00:00 13.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 261 columns

DataTime Column Feature Extraction
This dataset is prepared for models other than ARIMA. ARIMA was tested withour DateTime feture extraction as ARIMA can handle DataTime feature.

2019 Data

Adding hour,day of the week, daynumber, min and max temp in the Data and scaling it so that it can be trained

In [0]:
query = """
with wd as (
    SELECT 
        cast(year as STRING) as year,
        EXTRACT (DAYOFYEAR FROM CAST(CONCAT(year,'-',mo,'-',da) AS TIMESTAMP)) AS daynumber, 
        MIN(EXTRACT (DAYOFWEEK FROM CAST(CONCAT(year,'-',mo,'-',da) AS TIMESTAMP))) dayofweek,
        MIN(min) mintemp, MAX(max) maxtemp, MAX(IF(prcp=99.99,0,prcp)) rain
    FROM `bigquery-public-data.noaa_gsod.gsod2019`
    WHERE stn='725030'   --station id 725030=LaGuardia
    GROUP BY 1,2 
  
  -- TAXI DATA
  ),
TD AS(

SELECT  zone_id, time, pickups ,CAST(EXTRACT (YEAR from time) AS STRING) AS year,EXTRACT (DAYOFYEAR from time) AS daynumber,EXTRACT (HOUR from time) AS hour from `hello.CombinedData` ),points AS
(SELECT *,ST_CENTROID(zone_geom) as p FROM `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom`)
SELECT 
td.year,
points.zone_id,
td.time,
td.daynumber,
td.hour,
cast(wd.dayofweek as STRING) as dayofweek, 
    wd.mintemp, 
    wd.maxtemp,
    wd.rain,
 td.pickups,
    ST_ASTEXT(points.p) AS P,
  FROM wd, td INNER JOIN points ON points.zone_id=td.zone_id
  where wd.year = td.year AND
  wd.daynumber = td.daynumber
  group by year,zone_id,time,daynumber,hour,pickups,P,dayofweek, mintemp, maxtemp, rain
"""
df2 = client.query(query).to_dataframe()
In [0]:
zone=df2[['zone_id', 'pickups']]
In [0]:
zonewise_2019 = zone.groupby(['zone_id']).sum()
In [0]:
zonewise_2019.reset_index(inplace = True)
In [0]:
import plotly.express as px
configure_plotly_browser_state()

fig = px.bar( x=zonewise_2019['zone_id'], y=zonewise_2019['pickups'])
fig.update_layout(
    title="Which Zone id has highest number of pickups in 2019?",
    xaxis_title="Zone ID",
    yaxis_title="Pickups",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="#7f7f7f"
    )
)
pyo.iplot(fig)

In [0]:
daynumber=df2[['daynumber', 'pickups']]
In [0]:
daynumber_2019 = daynumber.groupby(['daynumber']).sum()
daynumber_2019.reset_index(inplace = True)
In [0]:
import plotly.express as px
configure_plotly_browser_state()

fig = px.bar( x=daynumber_2019['daynumber'], y=daynumber_2019['pickups'])
fig.update_layout(
    title="How pickups changes Day by Day in 2019?",
    xaxis_title="Day Number",
    yaxis_title="Pickups",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="#7f7f7f"
    )
)
pyo.iplot(fig)

In [0]:
ans_2019=df2[['time','pickups']]

hourwise=ans_2019.groupby([ans_2019["time"].dt.year, ans_2019["time"].dt.hour]).mean()
In [0]:
hourwise.index.names = ["year", "hour"]
hourwise.reset_index(inplace = True)
In [0]:
import plotly.express as px
configure_plotly_browser_state()

fig1 = px.bar( x=hourwise['hour'], y=hourwise['pickups'])
fig1.update_layout(
    title="Average pickups Hour by Hour in 2019? (Considering all Zones)",
    xaxis_title="Hour",
    yaxis_title="Average Pickups",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="#7f7f7f"
    )
)

pyo.iplot(fig1)

2018 Data

In [0]:
query = """
with wd as (
    SELECT 
        cast(year as STRING) as year,
        EXTRACT (DAYOFYEAR FROM CAST(CONCAT(year,'-',mo,'-',da) AS TIMESTAMP)) AS daynumber, 
        MIN(EXTRACT (DAYOFWEEK FROM CAST(CONCAT(year,'-',mo,'-',da) AS TIMESTAMP))) dayofweek,
        MIN(min) mintemp, MAX(max) maxtemp, MAX(IF(prcp=99.99,0,prcp)) rain
    FROM `bigquery-public-data.noaa_gsod.gsod2018`
    WHERE stn='725030'   --station id 725030=LaGuardia
    GROUP BY 1,2 
  
  -- TAXI DATA
  ),
TD AS(

SELECT  zone_id, time, pickups ,CAST(EXTRACT (YEAR from time) AS STRING) AS year,EXTRACT (DAYOFYEAR from time) AS daynumber,EXTRACT (HOUR from time) AS hour from `hello.CombinedData` ),points AS
(SELECT *,ST_CENTROID(zone_geom) as p FROM `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom`)
SELECT 
td.year,
points.zone_id,
td.time,
td.daynumber,
td.hour,
cast(wd.dayofweek as STRING) as dayofweek, 
    wd.mintemp, 
    wd.maxtemp,
    wd.rain,
 td.pickups,
    ST_ASTEXT(points.p) AS P,
  FROM wd, td INNER JOIN points ON points.zone_id=td.zone_id
  where wd.year = td.year AND
  wd.daynumber = td.daynumber
  group by year,zone_id,time,daynumber,hour,pickups,P,dayofweek, mintemp, maxtemp, rain
"""
df_2018 = client.query(query).to_dataframe()
In [0]:
daynumber2018=df_2018[['daynumber', 'pickups']]
In [0]:
daynumber_2018 = daynumber2018.groupby(['daynumber']).sum()
In [0]:
daynumber_2018.reset_index(inplace = True)
In [0]:
import plotly.express as px
configure_plotly_browser_state()

fig1 = px.bar( x=daynumber_2018['daynumber'], y=daynumber_2018['pickups'])
fig1.update_layout(
    title="How pickups changes Day by Day in 2018?",
    xaxis_title="Day Number",
    yaxis_title="Pickups",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="#7f7f7f"
    )
)

pyo.iplot(fig1)

import plotly.express as px
configure_plotly_browser_state()

fig = px.bar( x=daynumber_2019['daynumber'], y=daynumber_2019['pickups'])
fig.update_layout(
    title="How pickups changes Day by Day in 2019?",
    xaxis_title="Day Number",
    yaxis_title="Pickups",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="#7f7f7f"
    )
)
pyo.iplot(fig)

In [0]:
df_2018.head()
Out[0]:
year zone_id time daynumber hour dayofweek mintemp maxtemp rain pickups P
0 2018 235 2018-06-01 19:00:00+00:00 152 19 6 62.1 82.0 0.08 1.0 POINT(-73.9159758118083 40.8525209629819)
1 2018 235 2018-06-02 16:00:00+00:00 153 16 7 63.0 87.1 0.02 1.0 POINT(-73.9159758118083 40.8525209629819)
2 2018 235 2018-06-07 15:00:00+00:00 158 15 5 60.1 71.1 0.00 1.0 POINT(-73.9159758118083 40.8525209629819)
3 2018 235 2018-06-09 04:00:00+00:00 160 4 7 62.1 82.9 0.00 1.0 POINT(-73.9159758118083 40.8525209629819)
4 2018 235 2018-06-10 13:00:00+00:00 161 13 1 66.0 81.0 0.00 1.0 POINT(-73.9159758118083 40.8525209629819)

I want to determine the relationship between the outcome and the input variables. This makes it a regression problem. The input variables are zone_id, daynumber, hour, mintemp and maxtemp,rain,year. The output variable is the pickups. But I can use forecasting models like ARIMA to predict the number of pickups if the dataset can be wrangled to a time series data

In [0]:
ans=df_2018[['time','pickups']]
hourwise=ans.groupby([ans["time"].dt.year, ans["time"].dt.hour]).mean()
In [0]:
hourwise.index.names = ["year", "hour"]
hourwise.reset_index(inplace = True)
In [0]:
import plotly.express as px
configure_plotly_browser_state()

fig1 = px.bar( x=hourwise['hour'], y=hourwise['pickups'])
fig1.update_layout(
    title="Average pickups Hour by Hour in 2018? ( COnsidering all Zones)",
    xaxis_title="Hour",
    yaxis_title="Average Pickups",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="#7f7f7f"
    )
)

pyo.iplot(fig1)

By exploration, it can be said that the some features to be considered for training a model are as follows:

  1. Zone id: Zone id is a categorical variable and can be thus has been handled using the techinique of One Hot Encoding.
  2. Hour of the day: As we say above that the rides drop between 1AM - 5AM. When it is 6AM the number of pickups increases and we expect the maximum pickups between 6 PM to 9PM. Hence it is also an important factor
  3. Day of the week: It is again a feature that can be considered to train the model. Like on weekends the pickups will be different as compared to the week days.
  4. Day of the year: It is again a feature that can be considered to train the model.The pickups varies as the days progresses.
  5. Min Temp: It can be considered in order to build the model.The pickups varies min temp changes.
  6. Max Temp: It is the maximum temperature that can be reached in a day It is also considered one of the feature while building the model.

In [0]:
import matplotlib.pyplot as plt # visualization
import seaborn as sns
data = pd.concat([hourwise['hour'], hourwise['pickups']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='hour', y='pickups', data=data, showfliers=False)
In [0]:
f, ax = plt.subplots(figsize=(8, 6))
sns.distplot(hourwise['pickups'])

Correlation Matrix: Correlation is a bi-variate analysis that measures the strength of association between two variables and the direction of the relationship. In terms of the strength of relationship, the value of the correlation coefficient varies between +1 and -1. A value of ± 1 indicates a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. The direction of the relationship is indicated by the sign of the coefficient; a + sign indicates a positive relationship and a — sign indicates a negative relationship. Usually, in statistics, we measure four types of correlations: Pearson correlation, Kendall rank correlation, and Spearman correlation. The graph below will give you an idea about correlation.

In [0]:
import seaborn as sns

import matplotlib.pyplot as plt

f, ax = plt.subplots(figsize=(8, 6))
corrMatrix = df_2018.corr()
sns.heatmap(corrMatrix, annot=True)

Feature Engineering for using ARIMA

In [0]:
query = """
SELECT  * from `hello.CombinedData`
"""
df2 = client.query(query).to_dataframe()
In [0]:
df2.head()
Out[0]:
zone_id time pickups
0 7 2018-01-01 13:00:00+00:00 14.0
1 7 2018-01-04 12:00:00+00:00 14.0
2 7 2018-01-04 13:00:00+00:00 14.0
3 7 2018-01-05 21:00:00+00:00 14.0
4 7 2018-01-06 13:00:00+00:00 14.0
In [0]:
data=df2
In [0]:
data["zone_id"] = data["zone_id"].astype(int)
In [0]:
data.drop('zone_id', axis=1, inplace=True)
In [0]:
import pandas as pd
data.time = pd.to_datetime(data.time,format='%Y-%m-%d')
data.index = data.time
data = data.drop('time', axis=1)
data = data.resample('D').sum() # Resmapling the time series data with month starting first.
# Train-Test splitting of time series data

Time Binning

Binning (also called bucketing) is the process of converting a continuous feature into multiple binary features called bins or buckets, typically based on value range. Using resample function we can divide the time into bins. I have taken a bin of one day. To get the pickups in a day. Later if the model is accurate then I will divide the bins to smaller bins of hourly window.

In [0]:
data.head()
Out[0]:
pickups
time
2018-01-01 00:00:00+00:00 231546.0
2018-01-02 00:00:00+00:00 233052.0
2018-01-03 00:00:00+00:00 261087.0
2018-01-04 00:00:00+00:00 119818.0
2018-01-05 00:00:00+00:00 259731.0
In [0]:
import matplotlib.pyplot as plt
explore1 = data[:int(0.7*(len(data)))]
explore2 = data[int(0.7*(len(data))):]
# ARIMA takes univariate data.
explore1 = explore1['pickups']
explore2 = explore2['pickups']
# Plot of Weekly_Sales with respect to years in train and test.
explore1.plot(figsize=(20,8), title= 'Daily Pickups', fontsize=14)
explore2.plot(figsize=(20,8), title= 'Daily Pickups', fontsize=14)
plt.show()
/usr/local/lib/python3.6/dist-packages/pandas/core/arrays/datetimes.py:1269: UserWarning: Converting to PeriodArray/Index representation will drop timezone information.
  UserWarning,
/usr/local/lib/python3.6/dist-packages/pandas/core/arrays/datetimes.py:1269: UserWarning: Converting to PeriodArray/Index representation will drop timezone information.
  UserWarning,
In [0]:
data = data.fillna(method='ffill')

ADFULLER Test for Stationary

In [0]:
from statsmodels.tsa.stattools import adfuller
result = adfuller(data['pickups'])
print('ADF Statistic: {}'.format(result[0]))
print('p-value: {}'.format(result[1]))
print('Critical Values:')
for key, value in result[4].items():
    print('\t{}: {}'.format(key, value))
ADF Statistic: -1.8620831215235754
p-value: 0.35010313377312674
Critical Values:
	1%: -3.439593802573824
	5%: -2.865619356068967
	10%: -2.568942332870462

The p-value is not in 0.05. So the seasonality from the Data need to be removed. This will be done in Model BUilding Phase.

Summary

Now I have completed feture engineering both for ARIMA and also for Xgboost, RandomForest and other regression techniques. Now I will apply cross validation and use MAPE as the benchmark for deciding the model which I will be using in my Flask application.

Variable selection: For regression method I seleected the following: Zone id: Zone id is a categorical variable and can be thus has been handled using the techinique of One Hot Encoding. Hour of the day: As we say above that the rides drop between 1AM - 5AM. When it is 6AM the number of pickups increases and we expect the maximum pickups between 6 PM to 9PM. Hence it is also an important factor Day of the week: It is again a feature that can be considered to train the model. Like on weekends the pickups will be different as compared to the week days. Day of the year: It is again a feature that can be considered to train the model.The pickups varies as the days progresses. Min Temp: It can be considered in order to build the model.The pickups varies min temp changes. Max Temp: It is the maximum temperature that can be reached in a day It is also considered one of the feature while building the model.

For time series analysis I selected the following: Time and pickups after performing Binning.

Model Selection: For regression method I seleected the following:

  1. Linear Regression
  2. Xgboost
  3. Random Forest Regressor
  4. Light GBM

For time series analysis I selected the following:

  1. ARIMA
  2. Seasonal ARIMA
  3. Prophets

MAPE , MASE AND MAE are three performance measures that are considered in deciding the final model.

temp-158716190787587153

Open In Colab

Model Building

Prepared By: Simranjeet Randhawa

In [0]:
from google.colab import drive
drive.mount('/content/drive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
In [0]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/content/drive/My Drive/Data Science/nyc-taxi-265120-baf9a3e4cf9b.json"
%reload_ext google.cloud.bigquery
In [0]:
project_id = 'nyc-taxi-265120' 
from google.cloud import bigquery
client = bigquery.Client(project = project_id)

1. Model Building

In this phase, I will develop datasets for training and testing purposes. Moreover I will consider whether your existing tools will suffice for running the models or it will need a more robust environment (like fast and parallel processing). I will analyse different models on the data and thereby choose the one with minimum mean square error and higher accuracy.

In [0]:
query = """
SELECT  * from `hello.CombinedData`
"""
df2 = client.query(query).to_dataframe()
data=df2
In [0]:
import pandas as pd
data["zone_id"] = data["zone_id"].astype(int)
data.drop('zone_id', axis=1, inplace=True)
data.time = pd.to_datetime(data.time,format='%Y-%m-%d')
data.index = data.time
data = data.drop('time', axis=1)
data = data.resample('D').sum()
In [0]:
# Train-Test splitting of time series data
import matplotlib.pyplot as plt
train_data = data[:int(0.7*(len(data)))]
test_data = data[int(0.7*(len(data))):]
train_data = train_data['pickups']
test_data = test_data['pickups']
train_data.plot(figsize=(20,8), title= 'Day Pickups ', fontsize=14)
test_data.plot(figsize=(20,8), title= 'Day Pickups', fontsize=14)
plt.show()
In [0]:
from statsmodels.tsa.stattools import adfuller
result = adfuller(data['pickups'])
print('p-value is calculated as: {}'.format(result[1]))
p-value is calculated as: 0.35010313377312674
In [0]:
!pip install pyramid-arima

First Trying without removing seasonality then I will remove seasonality out of data and forecast the number of pickups

In [0]:
# Applying auto_arima model on train data.
from pyramid.arima import auto_arima
model_auto_arima = auto_arima(train_data, trace=True, error_action='ignore', suppress_warnings=True)
model_auto_arima = auto_arima(train_data, trace=True,start_p=0, start_q=0, start_P=0, start_Q=0, max_p=10, max_q=10, max_P=10, max_Q=10, seasonal=True,stepwise=False, suppress_warnings=True, D=1, max_D=10,error_action='ignore',approximation = False)
model_auto_arima.fit(train_data)
Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=11710.874, BIC=11736.268, Fit time=1.053 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=11839.145, BIC=11847.610, Fit time=0.028 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=11841.146, BIC=11853.843, Fit time=0.051 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11844.011, BIC=11856.708, Fit time=0.063 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=11802.389, BIC=11823.552, Fit time=0.180 seconds
Fit ARIMA: order=(3, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=11675.143, BIC=11704.770, Fit time=1.512 seconds
Fit ARIMA: order=(3, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11733.010, BIC=11758.405, Fit time=0.296 seconds
Fit ARIMA: order=(3, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=11775.916, BIC=11809.776, Fit time=0.971 seconds
Fit ARIMA: order=(2, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11752.180, BIC=11773.342, Fit time=0.229 seconds
Fit ARIMA: order=(4, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=11694.880, BIC=11732.972, Fit time=1.819 seconds
Fit ARIMA: order=(4, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=11640.034, BIC=11673.894, Fit time=1.772 seconds
Fit ARIMA: order=(4, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11729.631, BIC=11759.259, Fit time=0.428 seconds
Fit ARIMA: order=(5, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=11627.128, BIC=11669.452, Fit time=2.410 seconds
Fit ARIMA: order=(5, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=11619.703, BIC=11657.795, Fit time=2.409 seconds
Fit ARIMA: order=(5, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11690.992, BIC=11724.852, Fit time=0.754 seconds
Total fit time: 13.987 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=11839.145, BIC=11847.610, Fit time=0.038 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11844.011, BIC=11856.708, Fit time=0.090 seconds
Fit ARIMA: order=(0, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=11798.898, BIC=11815.828, Fit time=0.150 seconds
Fit ARIMA: order=(0, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=11745.936, BIC=11767.098, Fit time=0.229 seconds
Fit ARIMA: order=(0, 1, 4) seasonal_order=(0, 0, 0, 1); AIC=11747.480, BIC=11772.875, Fit time=0.342 seconds
Fit ARIMA: order=(0, 1, 5) seasonal_order=(0, 0, 0, 1); AIC=11748.919, BIC=11778.546, Fit time=0.571 seconds
Fit ARIMA: order=(0, 1, 6) seasonal_order=(0, 0, 0, 1); AIC=11748.062, BIC=11781.921, Fit time=0.620 seconds
Fit ARIMA: order=(0, 1, 7) seasonal_order=(0, 0, 0, 1); AIC=11705.552, BIC=11743.644, Fit time=0.780 seconds
Fit ARIMA: order=(0, 1, 8) seasonal_order=(0, 0, 0, 1); AIC=11707.568, BIC=11749.893, Fit time=0.894 seconds
Fit ARIMA: order=(0, 1, 9) seasonal_order=(0, 0, 0, 1); AIC=11684.238, BIC=11730.795, Fit time=1.238 seconds
Fit ARIMA: order=(0, 1, 10) seasonal_order=(0, 0, 0, 1); AIC=11682.639, BIC=11733.429, Fit time=1.419 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=11841.146, BIC=11853.843, Fit time=0.050 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11849.458, BIC=11866.388, Fit time=0.061 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=11802.389, BIC=11823.552, Fit time=0.281 seconds
Fit ARIMA: order=(1, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=11746.573, BIC=11771.967, Fit time=0.557 seconds
Fit ARIMA: order=(1, 1, 4) seasonal_order=(0, 0, 0, 1); AIC=11749.813, BIC=11779.440, Fit time=0.572 seconds
Fit ARIMA: order=(1, 1, 5) seasonal_order=(0, 0, 0, 1); AIC=11751.278, BIC=11785.138, Fit time=0.871 seconds
Fit ARIMA: order=(1, 1, 6) seasonal_order=(0, 0, 0, 1); AIC=11736.576, BIC=11774.668, Fit time=0.915 seconds
Fit ARIMA: order=(1, 1, 7) seasonal_order=(0, 0, 0, 1); AIC=11705.675, BIC=11748.000, Fit time=1.015 seconds
Fit ARIMA: order=(1, 1, 8) seasonal_order=(0, 0, 0, 1); AIC=11699.121, BIC=11745.678, Fit time=1.214 seconds
Fit ARIMA: order=(1, 1, 9) seasonal_order=(0, 0, 0, 1); AIC=11680.964, BIC=11731.754, Fit time=1.732 seconds
Fit ARIMA: order=(2, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=11828.904, BIC=11845.834, Fit time=0.109 seconds
Fit ARIMA: order=(2, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11752.180, BIC=11773.342, Fit time=0.195 seconds
Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=11710.874, BIC=11736.268, Fit time=1.116 seconds
Fit ARIMA: order=(2, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=11787.222, BIC=11816.849, Fit time=0.487 seconds
Fit ARIMA: order=(2, 1, 4) seasonal_order=(0, 0, 0, 1); AIC=11627.950, BIC=11661.810, Fit time=1.603 seconds
Fit ARIMA: order=(2, 1, 5) seasonal_order=(0, 0, 0, 1); AIC=11629.961, BIC=11668.053, Fit time=2.519 seconds
Fit ARIMA: order=(2, 1, 6) seasonal_order=(0, 0, 0, 1); AIC=11623.114, BIC=11665.438, Fit time=2.591 seconds
Fit ARIMA: order=(2, 1, 7) seasonal_order=(0, 0, 0, 1); AIC=11611.921, BIC=11658.478, Fit time=3.263 seconds
Fit ARIMA: order=(2, 1, 8) seasonal_order=(0, 0, 0, 1); AIC=11612.252, BIC=11663.042, Fit time=3.912 seconds
Fit ARIMA: order=(3, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=11788.300, BIC=11809.462, Fit time=0.155 seconds
Fit ARIMA: order=(3, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11733.010, BIC=11758.405, Fit time=0.268 seconds
Fit ARIMA: order=(3, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=11675.143, BIC=11704.770, Fit time=0.972 seconds
Fit ARIMA: order=(3, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=11775.916, BIC=11809.776, Fit time=0.929 seconds
Fit ARIMA: order=(3, 1, 4) seasonal_order=(0, 0, 0, 1); AIC=11630.664, BIC=11668.756, Fit time=2.225 seconds
Fit ARIMA: order=(3, 1, 5) seasonal_order=(0, 0, 0, 1); AIC=11628.315, BIC=11670.640, Fit time=2.695 seconds
Fit ARIMA: order=(3, 1, 6) seasonal_order=(0, 0, 0, 1); AIC=11629.709, BIC=11676.266, Fit time=2.748 seconds
Fit ARIMA: order=(3, 1, 7) seasonal_order=(0, 0, 0, 1); AIC=11626.149, BIC=11676.938, Fit time=3.627 seconds
Fit ARIMA: order=(4, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=11772.843, BIC=11798.237, Fit time=0.245 seconds
Fit ARIMA: order=(4, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11729.631, BIC=11759.259, Fit time=0.429 seconds
Fit ARIMA: order=(4, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=11640.034, BIC=11673.894, Fit time=1.758 seconds
Fit ARIMA: order=(4, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=11694.880, BIC=11732.972, Fit time=1.836 seconds
Fit ARIMA: order=(4, 1, 4) seasonal_order=(0, 0, 0, 1); AIC=11635.588, BIC=11677.912, Fit time=2.538 seconds
Fit ARIMA: order=(4, 1, 5) seasonal_order=(0, 0, 0, 1); AIC=11563.109, BIC=11609.666, Fit time=2.955 seconds
Fit ARIMA: order=(4, 1, 6) seasonal_order=(0, 0, 0, 1); AIC=11569.730, BIC=11620.519, Fit time=3.073 seconds
Fit ARIMA: order=(5, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=11697.204, BIC=11726.831, Fit time=0.336 seconds
Fit ARIMA: order=(5, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11690.992, BIC=11724.852, Fit time=0.542 seconds
Fit ARIMA: order=(5, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=11619.703, BIC=11657.795, Fit time=2.185 seconds
Fit ARIMA: order=(5, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=11627.128, BIC=11669.452, Fit time=2.414 seconds
Fit ARIMA: order=(5, 1, 4) seasonal_order=(0, 0, 0, 1); AIC=11565.846, BIC=11612.403, Fit time=2.313 seconds
Fit ARIMA: order=(5, 1, 5) seasonal_order=(0, 0, 0, 1); AIC=11566.455, BIC=11617.244, Fit time=3.034 seconds
Fit ARIMA: order=(6, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=11680.189, BIC=11714.048, Fit time=0.535 seconds
Fit ARIMA: order=(6, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11664.255, BIC=11702.347, Fit time=1.033 seconds
Fit ARIMA: order=(6, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=11620.210, BIC=11662.535, Fit time=1.914 seconds
Fit ARIMA: order=(6, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=11634.457, BIC=11681.014, Fit time=2.464 seconds
Fit ARIMA: order=(6, 1, 4) seasonal_order=(0, 0, 0, 1); AIC=11572.927, BIC=11623.717, Fit time=3.263 seconds
Fit ARIMA: order=(7, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=11636.413, BIC=11674.505, Fit time=0.561 seconds
Fit ARIMA: order=(7, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11630.056, BIC=11672.381, Fit time=0.816 seconds
Fit ARIMA: order=(7, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=11624.396, BIC=11670.953, Fit time=1.230 seconds
Fit ARIMA: order=(7, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=11624.670, BIC=11675.460, Fit time=3.148 seconds
Fit ARIMA: order=(8, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=11625.740, BIC=11668.065, Fit time=0.840 seconds
Fit ARIMA: order=(8, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11624.964, BIC=11671.521, Fit time=2.019 seconds
Fit ARIMA: order=(8, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=11628.183, BIC=11678.973, Fit time=2.600 seconds
Fit ARIMA: order=(9, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=11626.009, BIC=11672.565, Fit time=1.045 seconds
Fit ARIMA: order=(9, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=11623.963, BIC=11674.752, Fit time=2.498 seconds
Fit ARIMA: order=(10, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=11632.552, BIC=11683.342, Fit time=1.248 seconds
Total fit time: 91.976 seconds
Out[0]:
ARIMA(callback=None, disp=0, maxiter=50, method=None, order=(4, 1, 5),
      out_of_sample_size=0, scoring='mse', scoring_args={},
      seasonal_order=(0, 0, 0, 1), solver='lbfgs', start_params=None,
      suppress_warnings=True, transparams=True, trend='c')

The prediction is wrong as the seasonality was not removed (But was just trying to learn)

In [0]:
import matplotlib.pyplot as plt
forecast = model_auto_arima.predict(n_periods=len(test_data))
forecast = pd.DataFrame(forecast,index = test_data.index,columns=['Prediction'])
plt.figure(figsize=(20,6))
plt.title('Prediction of Pickups', fontsize=20)
plt.plot(train_data, label='Train')
plt.plot(test_data, label='Test')
plt.plot(forecast, label='Prediction using ARIMA Model')
plt.legend(loc='best')
plt.xlabel('Date', fontsize=14)
plt.ylabel('Passenger Pickups', fontsize=14)
plt.show()
Mean Absolute percentage error = ((actual-predicted)/actual)*100
In [0]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error
import math
print('R2 Score of ARIMA: ', r2_score(test_data, forecast))
print('Mean Squared Error (MSE) of ARIMA: ', mean_squared_error(test_data, forecast))
print('Root Mean Squared Error (RMSE) of ARIMA: ', math.sqrt(mean_squared_error(test_data, forecast)))
print('Mean Absolute Error (MAE) of ARIMA: ', mean_absolute_error(test_data, forecast))
R2 Score of ARIMA:  -0.9865317049196896
Mean Squared Error (MSE) of ARIMA:  2181436834.565685
Root Mean Squared Error (RMSE) of ARIMA:  46705.8543928455
Mean Absolute Error (MAE) of ARIMA:  38589.421099422594

The error is large because the seasonality was not removed.So now the seasonality will be removed.

In [0]:
# Fitting the Holt-Winters method for Weekly Sales.
from statsmodels.tsa.api import ExponentialSmoothing
model_holt_winters = ExponentialSmoothing(train_data, seasonal_periods=24, seasonal='additive' ).fit() 
pred = model_holt_winters.forecast(len(test_data))# Predict the test data
#Visualize train, test and predicted data.
plt.figure(figsize=(20,6))
plt.title('Prediction of Daily Pickups using Holt-Winters model', fontsize=20)
plt.plot(train_data, label='Train')
plt.plot(test_data, label='Test')
plt.plot(pred, label='Prediction using Holt Winters Methods')
plt.legend(loc='best')
plt.xlabel('Date', fontsize=14)
plt.ylabel('Predicted Pickups', fontsize=14)
plt.show()
/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/holtwinters.py:712: ConvergenceWarning:

Optimization failed to converge. Check mle_retvals.

In [0]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error
import math
holtwinterr2=r2_score(test_data, pred)
holtwinterMSE=mean_squared_error(test_data, pred)
holtwinterRMSE=math.sqrt(mean_squared_error(test_data, pred))
holtwinterMAE= mean_absolute_error(test_data, pred)
print('R2 Score Holt Winter: ', r2_score(test_data, pred))
print('Mean Squared Error (MSE) of Holt Winter: ', mean_squared_error(test_data, pred))
print('Root Mean Squared Error (RMSE) of Holt Winter: ', math.sqrt(mean_squared_error(test_data, pred)))
print('Mean Absolute Error (MAE) of Holt Winter: ', mean_absolute_error(test_data, pred))
R2 Score Holt Winter:  -0.10768978627391967
Mean Squared Error (MSE) of Holt Winter:  1216368857.8772552
Root Mean Squared Error (RMSE) of Holt Winter:  34876.48000984697
Mean Absolute Error (MAE) of Holt Winter:  28822.247778329813
In [0]:
ts=data.groupby(["time"])["pickups"].sum()

Getting rolling mean and rolling sd in the time series data

In [0]:
plt.figure(figsize=(16,6))
plt.plot(ts.rolling(window=12,center=False).mean(),label='Rolling Mean');
plt.plot(ts.rolling(window=12,center=False).std(),label='Rolling Standard Deviation');
plt.legend();

Get Trend and Seasaonality out of the Data</b>

In [0]:
#Getting Seasonailty, Trend and Observed value using multiplicative model
import statsmodels.api as sm
# multiplicative
res = sm.tsa.seasonal_decompose(ts.values,freq=12,model="multiplicative")
fig = res.plot()
#fig.show()
In [0]:
# Additive model
res = sm.tsa.seasonal_decompose(ts.values,freq=12,model="additive")
#plt.figure(figsize=(16,12))
fig = res.plot()
#fig.show()

Stationary Time Series The observations in a stationary time series are not dependent on time.Time series are stationary if they do not have trend or seasonal effects. Summary statistics calculated on the time series are consistent over time, like the mean or the variance of the observations.The Augmented Dickey-Fuller test is a type of statistical test called a unit root test.The intuition behind a unit root test is that it determines how strongly a time series is defined by a trend.There are a number of unit root tests and the Augmented Dickey-Fuller may be one of the more widely used. It uses an autoregressive model and optimizes an information criterion across multiple different lag values.The null hypothesis of the test is that the time series can be represented by a unit root, that it is not stationary (has some time-dependent structure). The alternate hypothesis (rejecting the null hypothesis) is that the time series is stationary. Null Hypothesis (H0): If failed to be rejected, it suggests the time series has a unit root, meaning it is non-stationary. It has some time dependent structure. Alternate Hypothesis (H1): The null hypothesis is rejected; it suggests the time series does not have a unit root, meaning it is stationary. It does not have time-dependent structure. We interpret this result using the p-value from the test. A p-value below a threshold (such as 5% or 1%) suggests we reject the null hypothesis (stationary), otherwise a p-value above the threshold suggests we fail to reject the null hypothesis (non-stationary).
p-value > 0.05: Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.
p-value <= 0.05: Reject the null hypothesis (H0), the data does not have a unit root and is stationary.

In [0]:
# Stationarity tests
#Performing Dickey-Fuller test:
def test_stationarity(timeseries):  
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)

test_stationarity(ts)
Results of Dickey-Fuller Test:
Test Statistic                  -1.862083
p-value                          0.350103
#Lags Used                      19.000000
Number of Observations Used    710.000000
Critical Value (1%)             -3.439594
Critical Value (5%)             -2.865619
Critical Value (10%)            -2.568942
dtype: float64

p-value greater than 0.05 this means we have to remove trend and seasonality fromt he data.
Removing Trend and Seasonality out of the Data

In [0]:
# Remove trend
from pandas import Series as Series
# create a differenced series
def difference(dataset, interval=1):
    diff = list()
    for i in range(interval, len(dataset)):
        value = dataset[i] - dataset[i - interval]
        diff.append(value)
    return Series(diff)

# invert differenced forecast
def inverse_difference(last_ob, value):
    return value + last_ob
In [0]:
ts=data.groupby(["time"])["pickups"].sum()
ts.astype('float')
plt.figure(figsize=(16,16))
plt.subplot(311)
plt.title('Original')
plt.xlabel('Time')
plt.ylabel('Pickups')
plt.plot(ts)
plt.subplot(312)
plt.title('After De-trend')
plt.xlabel('Time')
plt.ylabel('Pickups')
new_ts=difference(ts)
plt.plot(new_ts)
plt.plot()
plt.subplot(313)
plt.title('After De-seasonalization')
plt.xlabel('Time')
plt.ylabel('Pickups')
new_ts=difference(ts,12)       # assuming the seasonality is 12 months long
plt.plot(new_ts)
plt.plot()
Out[0]:
[]
In [0]:
import matplotlib.pyplot as plt
train_data = new_ts[:int(0.7*(len(data)))]
test_data = new_ts[int(0.7*(len(data))):]

The Data is now Stationary as p< 0.05

In [0]:
test_stationarity(new_ts)
Results of Dickey-Fuller Test:
Test Statistic                -7.981551e+00
p-value                        2.615200e-12
#Lags Used                     2.000000e+01
Number of Observations Used    6.970000e+02
Critical Value (1%)           -3.439767e+00
Critical Value (5%)           -2.865696e+00
Critical Value (10%)          -2.568983e+00
dtype: float64
In [0]:
# Applying auto_arima model on train data.
from pyramid.arima import auto_arima
model_auto_arima = auto_arima(train_data, trace=True, error_action='ignore', suppress_warnings=True)
model_auto_arima = auto_arima(train_data, trace=True,start_p=0, start_q=0, start_P=0, start_Q=0, max_p=10, max_q=10, max_P=10, max_Q=10, seasonal=True,stepwise=False, suppress_warnings=True, D=1, max_D=10,error_action='ignore',approximation = False)
model_auto_arima.fit(train_data)
Fit ARIMA: order=(2, 0, 2) seasonal_order=(0, 0, 0, 1); AIC=12130.122, BIC=12155.528, Fit time=0.510 seconds
Fit ARIMA: order=(0, 0, 0) seasonal_order=(0, 0, 0, 1); AIC=12405.255, BIC=12413.724, Fit time=0.025 seconds
Fit ARIMA: order=(1, 0, 0) seasonal_order=(0, 0, 0, 1); AIC=12178.895, BIC=12191.598, Fit time=0.066 seconds
Fit ARIMA: order=(0, 0, 1) seasonal_order=(0, 0, 0, 1); AIC=12148.436, BIC=12161.139, Fit time=0.058 seconds
Fit ARIMA: order=(1, 0, 2) seasonal_order=(0, 0, 0, 1); AIC=12128.214, BIC=12149.386, Fit time=0.217 seconds
Fit ARIMA: order=(1, 0, 1) seasonal_order=(0, 0, 0, 1); AIC=12124.395, BIC=12141.333, Fit time=0.107 seconds
Fit ARIMA: order=(2, 0, 1) seasonal_order=(0, 0, 0, 1); AIC=12124.936, BIC=12146.108, Fit time=0.114 seconds
Total fit time: 1.104 seconds
Fit ARIMA: order=(0, 0, 0) seasonal_order=(0, 0, 0, 1); AIC=12405.255, BIC=12413.724, Fit time=0.016 seconds
Fit ARIMA: order=(0, 0, 1) seasonal_order=(0, 0, 0, 1); AIC=12148.436, BIC=12161.139, Fit time=0.051 seconds
Fit ARIMA: order=(0, 0, 2) seasonal_order=(0, 0, 0, 1); AIC=12126.437, BIC=12143.375, Fit time=0.079 seconds
Fit ARIMA: order=(0, 0, 3) seasonal_order=(0, 0, 0, 1); AIC=12126.476, BIC=12147.648, Fit time=0.177 seconds
Fit ARIMA: order=(0, 0, 4) seasonal_order=(0, 0, 0, 1); AIC=12131.337, BIC=12156.744, Fit time=0.316 seconds
Fit ARIMA: order=(0, 0, 5) seasonal_order=(0, 0, 0, 1); AIC=12108.668, BIC=12138.309, Fit time=0.650 seconds
Fit ARIMA: order=(0, 0, 6) seasonal_order=(0, 0, 0, 1); AIC=12106.052, BIC=12139.927, Fit time=0.601 seconds
Fit ARIMA: order=(0, 0, 7) seasonal_order=(0, 0, 0, 1); AIC=12033.524, BIC=12071.633, Fit time=2.319 seconds
Fit ARIMA: order=(0, 0, 8) seasonal_order=(0, 0, 0, 1); AIC=12033.141, BIC=12075.485, Fit time=3.208 seconds
Fit ARIMA: order=(0, 0, 9) seasonal_order=(0, 0, 0, 1); AIC=12028.756, BIC=12075.334, Fit time=3.953 seconds
Fit ARIMA: order=(0, 0, 10) seasonal_order=(0, 0, 0, 1); AIC=12018.582, BIC=12069.395, Fit time=4.487 seconds
Fit ARIMA: order=(1, 0, 0) seasonal_order=(0, 0, 0, 1); AIC=12178.895, BIC=12191.598, Fit time=0.065 seconds
Fit ARIMA: order=(1, 0, 1) seasonal_order=(0, 0, 0, 1); AIC=12124.395, BIC=12141.333, Fit time=0.077 seconds
Fit ARIMA: order=(1, 0, 2) seasonal_order=(0, 0, 0, 1); AIC=12128.214, BIC=12149.386, Fit time=0.169 seconds
Fit ARIMA: order=(1, 0, 3) seasonal_order=(0, 0, 0, 1); AIC=12126.575, BIC=12151.982, Fit time=0.396 seconds
Fit ARIMA: order=(1, 0, 4) seasonal_order=(0, 0, 0, 1); AIC=12128.966, BIC=12158.606, Fit time=0.621 seconds
Fit ARIMA: order=(1, 0, 5) seasonal_order=(0, 0, 0, 1); AIC=12081.873, BIC=12115.749, Fit time=0.968 seconds
Fit ARIMA: order=(1, 0, 6) seasonal_order=(0, 0, 0, 1); AIC=12028.382, BIC=12066.491, Fit time=2.693 seconds
Fit ARIMA: order=(1, 0, 7) seasonal_order=(0, 0, 0, 1); AIC=11982.440, BIC=12024.784, Fit time=2.634 seconds
Fit ARIMA: order=(1, 0, 8) seasonal_order=(0, 0, 0, 1); AIC=11987.474, BIC=12034.052, Fit time=3.279 seconds
Fit ARIMA: order=(1, 0, 9) seasonal_order=(0, 0, 0, 1); AIC=11983.445, BIC=12034.258, Fit time=4.112 seconds
Fit ARIMA: order=(2, 0, 0) seasonal_order=(0, 0, 0, 1); AIC=12124.920, BIC=12141.857, Fit time=0.070 seconds
Fit ARIMA: order=(2, 0, 1) seasonal_order=(0, 0, 0, 1); AIC=12124.936, BIC=12146.108, Fit time=0.119 seconds
Fit ARIMA: order=(2, 0, 2) seasonal_order=(0, 0, 0, 1); AIC=12130.122, BIC=12155.528, Fit time=0.419 seconds
Fit ARIMA: order=(2, 0, 3) seasonal_order=(0, 0, 0, 1); AIC=12062.583, BIC=12092.223, Fit time=1.322 seconds
Fit ARIMA: order=(2, 0, 4) seasonal_order=(0, 0, 0, 1); AIC=11962.904, BIC=11996.779, Fit time=1.410 seconds
Fit ARIMA: order=(2, 0, 5) seasonal_order=(0, 0, 0, 1); AIC=11955.742, BIC=11993.852, Fit time=1.829 seconds
Fit ARIMA: order=(2, 0, 6) seasonal_order=(0, 0, 0, 1); AIC=11981.645, BIC=12023.990, Fit time=2.668 seconds
Fit ARIMA: order=(2, 0, 7) seasonal_order=(0, 0, 0, 1); AIC=11951.476, BIC=11998.054, Fit time=2.790 seconds
Fit ARIMA: order=(2, 0, 8) seasonal_order=(0, 0, 0, 1); AIC=11952.344, BIC=12003.156, Fit time=3.563 seconds
Fit ARIMA: order=(3, 0, 0) seasonal_order=(0, 0, 0, 1); AIC=12125.884, BIC=12147.056, Fit time=0.109 seconds
Fit ARIMA: order=(3, 0, 1) seasonal_order=(0, 0, 0, 1); AIC=12109.933, BIC=12135.340, Fit time=0.798 seconds
Fit ARIMA: order=(3, 0, 2) seasonal_order=(0, 0, 0, 1); AIC=12130.517, BIC=12160.158, Fit time=0.374 seconds
Fit ARIMA: order=(3, 0, 3) seasonal_order=(0, 0, 0, 1); AIC=11994.906, BIC=12028.781, Fit time=1.650 seconds
Fit ARIMA: order=(3, 0, 4) seasonal_order=(0, 0, 0, 1); AIC=11952.174, BIC=11990.284, Fit time=1.933 seconds
Fit ARIMA: order=(3, 0, 5) seasonal_order=(0, 0, 0, 1); AIC=11915.683, BIC=11958.027, Fit time=2.331 seconds
Fit ARIMA: order=(3, 0, 6) seasonal_order=(0, 0, 0, 1); AIC=11980.156, BIC=12026.734, Fit time=2.771 seconds
Fit ARIMA: order=(3, 0, 7) seasonal_order=(0, 0, 0, 1); AIC=12041.197, BIC=12092.009, Fit time=3.470 seconds
Fit ARIMA: order=(4, 0, 0) seasonal_order=(0, 0, 0, 1); AIC=12122.744, BIC=12148.151, Fit time=0.168 seconds
Fit ARIMA: order=(4, 0, 1) seasonal_order=(0, 0, 0, 1); AIC=12123.029, BIC=12152.670, Fit time=0.362 seconds
Fit ARIMA: order=(4, 0, 2) seasonal_order=(0, 0, 0, 1); AIC=11959.079, BIC=11992.954, Fit time=1.795 seconds
Fit ARIMA: order=(4, 0, 3) seasonal_order=(0, 0, 0, 1); AIC=11961.906, BIC=12000.016, Fit time=1.643 seconds
Fit ARIMA: order=(4, 0, 4) seasonal_order=(0, 0, 0, 1); AIC=11959.655, BIC=12001.999, Fit time=2.024 seconds
Fit ARIMA: order=(4, 0, 5) seasonal_order=(0, 0, 0, 1); AIC=12022.888, BIC=12069.467, Fit time=2.281 seconds
Fit ARIMA: order=(4, 0, 6) seasonal_order=(0, 0, 0, 1); AIC=11938.727, BIC=11989.540, Fit time=3.150 seconds
Fit ARIMA: order=(5, 0, 0) seasonal_order=(0, 0, 0, 1); AIC=12109.227, BIC=12138.868, Fit time=0.220 seconds
Fit ARIMA: order=(5, 0, 1) seasonal_order=(0, 0, 0, 1); AIC=12079.144, BIC=12113.020, Fit time=0.890 seconds
Fit ARIMA: order=(5, 0, 2) seasonal_order=(0, 0, 0, 1); AIC=11986.393, BIC=12024.502, Fit time=1.902 seconds
Fit ARIMA: order=(5, 0, 3) seasonal_order=(0, 0, 0, 1); AIC=11899.943, BIC=11942.287, Fit time=2.290 seconds
Fit ARIMA: order=(5, 0, 4) seasonal_order=(0, 0, 0, 1); AIC=11827.301, BIC=11873.880, Fit time=2.117 seconds
Fit ARIMA: order=(5, 0, 5) seasonal_order=(0, 0, 0, 1); AIC=11973.559, BIC=12024.372, Fit time=2.833 seconds
Fit ARIMA: order=(6, 0, 0) seasonal_order=(0, 0, 0, 1); AIC=12018.832, BIC=12052.707, Fit time=0.483 seconds
Fit ARIMA: order=(6, 0, 1) seasonal_order=(0, 0, 0, 1); AIC=12018.797, BIC=12056.907, Fit time=0.640 seconds
Fit ARIMA: order=(6, 0, 2) seasonal_order=(0, 0, 0, 1); AIC=11923.148, BIC=11965.492, Fit time=2.317 seconds
Fit ARIMA: order=(6, 0, 3) seasonal_order=(0, 0, 0, 1); AIC=11894.661, BIC=11941.240, Fit time=2.574 seconds
Fit ARIMA: order=(6, 0, 4) seasonal_order=(0, 0, 0, 1); AIC=11819.733, BIC=11870.546, Fit time=2.630 seconds
Fit ARIMA: order=(7, 0, 0) seasonal_order=(0, 0, 0, 1); AIC=12015.363, BIC=12053.473, Fit time=0.577 seconds
Fit ARIMA: order=(7, 0, 1) seasonal_order=(0, 0, 0, 1); AIC=11990.842, BIC=12033.186, Fit time=1.998 seconds
Fit ARIMA: order=(7, 0, 2) seasonal_order=(0, 0, 0, 1); AIC=11900.256, BIC=11946.835, Fit time=2.569 seconds
Fit ARIMA: order=(7, 0, 3) seasonal_order=(0, 0, 0, 1); AIC=11910.215, BIC=11961.028, Fit time=3.000 seconds
Fit ARIMA: order=(8, 0, 0) seasonal_order=(0, 0, 0, 1); AIC=11931.092, BIC=11973.436, Fit time=1.111 seconds
Fit ARIMA: order=(8, 0, 1) seasonal_order=(0, 0, 0, 1); AIC=11888.000, BIC=11934.579, Fit time=2.086 seconds
Fit ARIMA: order=(8, 0, 2) seasonal_order=(0, 0, 0, 1); AIC=11887.471, BIC=11938.284, Fit time=2.203 seconds
Fit ARIMA: order=(9, 0, 0) seasonal_order=(0, 0, 0, 1); AIC=11887.263, BIC=11933.841, Fit time=1.597 seconds
Fit ARIMA: order=(9, 0, 1) seasonal_order=(0, 0, 0, 1); AIC=11888.109, BIC=11938.922, Fit time=2.263 seconds
Fit ARIMA: order=(10, 0, 0) seasonal_order=(0, 0, 0, 1); AIC=11888.853, BIC=11939.666, Fit time=2.271 seconds
Total fit time: 108.542 seconds
Out[0]:
ARIMA(callback=None, disp=0, maxiter=50, method=None, order=(6, 0, 4),
      out_of_sample_size=0, scoring='mse', scoring_args={},
      seasonal_order=(0, 0, 0, 1), solver='lbfgs', start_params=None,
      suppress_warnings=True, transparams=True, trend='c')
In [0]:
import matplotlib.pyplot as plt
forecast = model_auto_arima.predict(n_periods=len(test_data))
forecast = pd.DataFrame(forecast,index = test_data.index,columns=['Prediction'])
plt.figure(figsize=(20,6))
plt.title('Prediction of Pickups', fontsize=20)
plt.plot(train_data, label='Train')
plt.plot(test_data, label='Test')
plt.plot(forecast, label='Prediction using ARIMA Model')
plt.legend(loc='best')
plt.xlabel('Date', fontsize=14)
plt.ylabel('Passenger Pickups', fontsize=14)
plt.show()
In [0]:
# Adding Trend Back to a Stationary Data
from pandas import Series as Series
# create a differenced series
def add(dataset, interval=1):
    diff = list()
    for i in range(interval, len(dataset)):
        value = dataset[i] + dataset[i - interval]
        diff.append(value)
    return Series(diff)
In [0]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error
import math
automarimaR2=r2_score(test_data, forecast)
autoarimaMSE=mean_squared_error(test_data, forecast)
autoarimaRMSE=math.sqrt(mean_squared_error(test_data, forecast))
autoarimaMAE=mean_absolute_error(test_data, forecast)
print('R2 Score of ARIMA: ', r2_score(test_data, forecast))
print('Mean Squared Error (MSE) of ARIMA: ', mean_squared_error(test_data, forecast))
print('Root Mean Squared Error (RMSE) of ARIMA: ', math.sqrt(mean_squared_error(test_data, forecast)))
print('Mean Absolute Error (MAD) of ARIMA: ', mean_absolute_error(test_data, forecast))

Now I will be calulating MAE for other techniques as well and then select the best model. Now I will be using Prophet and seeing the result.</p>

In [0]:
from fbprophet import Prophet

Now the p value is less than 0.05. This means now the value of the we need to decide the values of p and q ?

In [0]:
def tsplot(y, lags=None, figsize=(10, 8), style='bmh',title=''):
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
    with plt.style.context(style):    
        fig = plt.figure(figsize=figsize)
        #mpl.rcParams['font.family'] = 'Ubuntu Mono'
        layout = (3, 2)
        ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
        acf_ax = plt.subplot2grid(layout, (1, 0))
        pacf_ax = plt.subplot2grid(layout, (1, 1))
        qq_ax = plt.subplot2grid(layout, (2, 0))
        pp_ax = plt.subplot2grid(layout, (2, 1))
        
        y.plot(ax=ts_ax)
        ts_ax.set_title(title)
        smt.graphics.plot_acf(y, lags=lags, ax=acf_ax, alpha=0.5)
        smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax, alpha=0.5)
        sm.qqplot(y, line='s', ax=qq_ax)
        qq_ax.set_title('QQ Plot')        
        scs.probplot(y, sparams=(y.mean(), y.std()), plot=pp_ax)

        plt.tight_layout()
    return
In [0]:
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from pandas.plotting import autocorrelation_plot
from statsmodels.tsa.stattools import adfuller, acf, pacf,arma_order_select_ic
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs
In [0]:
ts=data.groupby(["time"])["pickups"].sum()
ts.index=pd.date_range(start = '2018-01-01',end='2019-12-31', freq = 'D')
ts=ts.reset_index()
In [0]:
model = Prophet( yearly_seasonality=True)
In [0]:
ts.columns=['ds','y']
 #instantiate Prophet with only yearly seasonality as our data is monthly 
model.fit(ts) #fit the model with your dataframe
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
Out[0]:
<fbprophet.forecaster.Prophet at 0x7f03d5b2da58>
In [0]:
future = model.make_future_dataframe(periods = 365, freq = 'D')  
# now lets make the forecasts
forecast = model.predict(future)
In [0]:
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
Out[0]:
ds yhat yhat_lower yhat_upper
1090 2020-12-26 141432.467516 115575.865950 167886.709875
1091 2020-12-27 102642.386166 77027.079382 129441.792365
1092 2020-12-28 108557.593977 82450.742970 136210.817115
1093 2020-12-29 128098.345681 102360.380710 154839.540949
1094 2020-12-30 132913.154434 105184.448712 159022.102325
In [0]:
#Forecasted result
model.plot(forecast)
Out[0]:
In [0]:
model.plot_components(forecast)
Out[0]:
In [0]:
from scipy.optimize import minimize
import statsmodels.tsa.api as smt
import statsmodels.api as sm
from tqdm import tqdm_notebook
from itertools import product

Trying Simple Mocving Average

In [0]:
def plot_moving_average(series, window, plot_intervals=False, scale=1.96):

    rolling_mean = series.rolling(window=window).mean()
    
    plt.figure(figsize=(17,8))
    plt.title('Moving average\n window size = {}'.format(window))
    plt.plot(rolling_mean, 'g', label='Rolling mean trend')
    
    if plot_intervals:
        mae = mean_absolute_error(series[window:], rolling_mean[window:])
        deviation = np.std(series[window:] - rolling_mean[window:])
        lower_bound = rolling_mean - (mae + scale * deviation)
        upper_bound = rolling_mean + (mae + scale * deviation)
        plt.plot(upper_bound, 'r--', label='Upper bound / Lower bound')
        plt.plot(lower_bound, 'r--')
            
    plt.plot(series[window:], label='Actual values')
    plt.legend(loc='best')
    plt.grid(True)
    
#Smooth by the previous 5 days (by week)
plot_moving_average(data.pickups, 5)

#Smooth by the previous month (30 days)
plot_moving_average(data.pickups, 30)

#Smooth by previous quarter (90 days)
plot_moving_average(data.pickups, 90, plot_intervals=True)
In [0]:
def exponential_smoothing(series, alpha):

    result = [series[0]] # first value is same as series
    for n in range(1, len(series)):
        result.append(alpha * series[n] + (1 - alpha) * result[n-1])
    return result
  
def plot_exponential_smoothing(series, alphas):
 
    plt.figure(figsize=(17, 8))
    for alpha in alphas:
        plt.plot(exponential_smoothing(series, alpha), label="Alpha {}".format(alpha))
    plt.plot(series.values, "c", label = "Actual")
    plt.legend(loc="best")
    plt.axis('tight')
    plt.title("Exponential Smoothing")
    plt.grid(True);

plot_exponential_smoothing(data.pickups, [0.05, 0.3])
In [0]:
def double_exponential_smoothing(series, alpha, beta):

    result = [series[0]]
    for n in range(1, len(series)+20):
        if n == 1:
            level, trend = series[0], series[1] - series[0]
        if n >= len(series): # forecasting
            value = result[-1]
        else:
            value = series[n]
        last_level, level = level, alpha * value + (1 - alpha) * (level + trend)
        trend = beta * (level - last_level) + (1 - beta) * trend
        result.append(level + trend)
    return result

def plot_double_exponential_smoothing(series, alphas, betas):
     
    plt.figure(figsize=(17, 8))
    for alpha in alphas:
        for beta in betas:
            plt.plot(double_exponential_smoothing(series, alpha, beta), label="Alpha {}, beta {}".format(alpha, beta))
    plt.plot(series.values, label = "Actual")
    plt.legend(loc="best")
    plt.axis('tight')
    plt.title("Double Exponential Smoothing")
    plt.grid(True)
    
plot_double_exponential_smoothing(data.pickups, alphas=[0.9, 0.02], betas=[0.9, 0.02])
In [0]:
pd.plotting.register_matplotlib_converters()
In [0]:
data_diff = data.pickups - data.pickups.shift(1)
In [0]:
tsplot(data_diff[1:], lags=30)
In [0]:
def optimizeSARIMA(parameters_list, d, D, s):
    results = []
    best_aic = float("inf")

    for param in tqdm_notebook(parameters_list):
        # we need try-except because on some combinations model fails to converge
        try:
            model=sm.tsa.statespace.SARIMAX(ads.Ads, order=(param[0], d, param[1]), 
                                            seasonal_order=(param[2], D, param[3], s)).fit(disp=-1)
        except:
            continue
        aic = model.aic
        # saving best model, AIC and parameters
        if aic < best_aic:
            best_model = model
            best_aic = aic
            best_param = param
        results.append([param, model.aic])

    result_table = pd.DataFrame(results)
    result_table.columns = ['parameters', 'aic']
    # sorting in ascending order, the lower AIC is - the better
    result_table = result_table.sort_values(by='aic', ascending=True).reset_index(drop=True)
    
    return result_table
In [0]:
best_model = sm.tsa.statespace.SARIMAX(data.pickups, order=(1, d, 0),
                                       seasonal_order=(1, 2, 0, s)).fit(disp=-1)

print(best_model.summary())
                                 Statespace Model Results                                 
==========================================================================================
Dep. Variable:                            pickups   No. Observations:                  730
Model:             SARIMAX(1, 1, 0)x(1, 2, 0, 24)   Log Likelihood               -8353.942
Date:                            Wed, 01 Apr 2020   AIC                          16713.884
Time:                                    04:40:34   BIC                          16727.455
Sample:                                01-01-2018   HQIC                         16719.137
                                     - 12-31-2019                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.0531      0.078      0.685      0.493      -0.099       0.205
ar.S.L24      -0.7589      0.051    -14.973      0.000      -0.858      -0.660
sigma2      4.711e+09   6.56e-12   7.18e+20      0.000    4.71e+09    4.71e+09
===================================================================================
Ljung-Box (Q):                     1519.12   Jarque-Bera (JB):               163.69
Prob(Q):                              0.00   Prob(JB):                         0.00
Heteroskedasticity (H):               0.58   Skew:                             0.25
Prob(H) (two-sided):                  0.00   Kurtosis:                         5.35
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 4.31e+35. Standard errors may be unstable.
In [0]:
tsplot(best_model.resid[24+1:])
In [0]:
def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
In [0]:
def plotSARIMA(series, model, n_steps):
    """
        Plots model vs predicted values
        
        series - dataset with timeseries
        model - fitted SARIMA model
        n_steps - number of steps to predict in the future
        
    """
    # adding model values
    data = series.copy()
    data.columns = ['actual']
    data['arima_model'] = model.fittedvalues
    # making a shift on s+d steps, because these values were unobserved by the model
    # due to the differentiating
    data['arima_model'][:s+d] = np.NaN 
    # forecasting on n_steps forward 
    forecast = model.predict(start = data.shape[0], end = data.shape[0]+n_steps)
    forecast = data.arima_model.append(forecast)
    # calculate error, again having shifted on s+d steps from the beginning
    error = mean_absolute_percentage_error(data['actual'][s+d:], data['arima_model'][s+d:])+
    plt.figure(figsize=(15, 7))
    plt.title("Mean Absolute Percentage Error: {0:.2f}%".format(error))
    plt.plot(forecast, color='r', label="model")
    plt.axvspan(data.index[-1], forecast.index[-1], alpha=0.5, color='lightgrey')
    plt.plot(data.actual, label="actual")
    plt.legend()
    plt.grid(True);
In [0]:
series=data
model=best_model
n_steps=0
# adding model values
data = series.copy()
data.columns = ['actual']
data['arima_model'] = model.fittedvalues
data['arima_model'][:s+d] = np.NaN 
forecast = model.predict(start = data.shape[0], end = data.shape[0]+n_steps)
forecast = data.arima_model.append(forecast)
error = mean_absolute_percentage_error(data['actual'][s+d:], data['arima_model'][s+d:])
automarimaR2=r2_score(data['actual'][s+d:], data['arima_model'][s+d:])
autoarimaMSE=mean_squared_error(data['actual'][s+d:], data['arima_model'][s+d:])
autoarimaRMSE=math.sqrt(mean_squared_error(data['actual'][s+d:], data['arima_model'][s+d:]))
autoarimaMAE=mean_absolute_error(data['actual'][s+d:], data['arima_model'][s+d:])
print('R2 Score of ARIMA: ', r2_score(data['actual'][s+d:], data['arima_model'][s+d:]))
print('Mean Squared Error (MSE) of ARIMA: ', mean_squared_error(data['actual'][s+d:], data['arima_model'][s+d:]))
print('Root Mean Squared Error (RMSE) of ARIMA: ', math.sqrt(mean_squared_error(data['actual'][s+d:], data['arima_model'][s+d:])))
print('Mean Absolute Error (MAD) of ARIMA: ',mean_absolute_error(data['actual'][s+d:], data['arima_model'][s+d:]))
In [0]:
print("MAPE ofS ARIMA",error)
print('R2 Score of ARIMA: ', r2_score(data['actual'][s+d:], data['arima_model'][s+d:]))
print('Mean Squared Error (MSE) of ARIMA: ', mean_squared_error(data['actual'][s+d:], data['arima_model'][s+d:]))
print('Root Mean Squared Error (RMSE) of ARIMA: ', math.sqrt(mean_squared_error(data['actual'][s+d:], data['arima_model'][s+d:])))
print('Mean Absolute Error (MAD) of ARIMA: ',mean_absolute_error(data['actual'][s+d:], data['arima_model'][s+d:]))
MAPE 14.440326296404796
R2 Score of ARIMA:  -0.16877484726477388
Mean Squared Error (MSE) of ARIMA:  2110525331.414948
Root Mean Squared Error (RMSE) of ARIMA:  45940.45419251912
Mean Absolute Error (MAD) of ARIMA:  35060.62482704164
In [0]:
plotSARIMA(data, best_model, 0)

So i obtained the best time series forecasting model using SARIMA and by optimising it MAPE was 14.44 %. Now I will try other regression techniques and compare the MAPE error in all cases and will consider the one with the minimum MAPE error rate.

In [0]:
query = """
with wd as (
    SELECT 
        cast(year as STRING) as year,
        EXTRACT (DAYOFYEAR FROM CAST(CONCAT(year,'-',mo,'-',da) AS TIMESTAMP)) AS daynumber, 
        MIN(EXTRACT (DAYOFWEEK FROM CAST(CONCAT(year,'-',mo,'-',da) AS TIMESTAMP))) dayofweek,
        MIN(temp) temp,MIN(min) mintemp, MAX(max) maxtemp, MAX(IF(prcp=99.99,0,prcp)) rain
    FROM `bigquery-public-data.noaa_gsod.gsod2019`
    WHERE stn='725030'   --station id 725030=LaGuardia
    GROUP BY 1,2  
  -- TAXI DATA
  ),
TD AS(
SELECT  zone_id, time, pickups ,CAST(EXTRACT (YEAR from time) AS STRING) AS year,EXTRACT (DAYOFYEAR from time) AS daynumber,EXTRACT (HOUR from time) AS hour from `hello.CombinedData` ),points AS
(SELECT *,ST_CENTROID(zone_geom) as p FROM `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom`)
SELECT 
td.year,
points.zone_id,
td.time,
td.daynumber,
td.hour,
cast(wd.dayofweek as STRING) as dayofweek, 
wd.temp,
wd.mintemp, 
wd.maxtemp,
wd.rain,
td.pickups,
ST_X(p) as longitude,
ST_Y(p) as latitude
FROM wd, td INNER JOIN points ON points.zone_id=td.zone_id
where wd.year = td.year AND
wd.daynumber = td.daynumber
group by year,zone_id,time,daynumber,hour,pickups,longitude,latitude,dayofweek,temp, mintemp, maxtemp, rain
"""
df2 = client.query(query).to_dataframe()
In [0]:
df2.head()
Out[0]:
year zone_id daynumber hour dayofweek temp rain pickups
0 2019 42 109 2 6 61.8 0.00 7.0
1 2019 42 121 2 4 52.3 0.05 7.0
2 2019 42 122 2 5 54.6 0.01 7.0
3 2019 42 136 10 5 63.5 0.10 7.0
4 2019 42 139 6 1 65.8 0.00 7.0
In [0]:
df2.drop('time', axis=1, inplace=True)
In [0]:
df2["zone_id"] = df2["zone_id"].astype(int)
df2["daynumber"] = df2["daynumber"].astype(int)
df2["year"] = df2["year"].astype(int)
df2.drop('longitude', axis=1, inplace=True)
df2.drop('latitude', axis=1, inplace=True)

df2.drop('mintemp', axis=1, inplace=True)
df2.drop('maxtemp', axis=1, inplace=True)
df2["dayofweek"] = df2["dayofweek"].astype(int)
y=df2['pickups']
X=df2.drop('pickups', axis=1)

I will also perform cross validation and GridSearch CV to get the optimum result from all the model. One hot Encoding was done in model planning phase

Under fitting: the line covers all the points shown in the graph. Such model tend to cause underfitting of data .It also called High Bias. Over-fitting: shows the predicted line covers all the points in graph. In such condition you can also think that it’s a good graph which cover all the points. But that’s not actually true, the predicted line into the graph covers all points which are noise and outlier. Such model are also responsible to predict poor result due to its complexity.It is also called High Variance. Appropriate-Fitting: shows a pretty good predicted line. It covers majority of the point in graph and also maintains the balance between bias and variance.

In [0]:
#Splitting into train and test data set.
#I will also perform cross validation to get the optimum result from all the model
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)
In [0]:
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from sklearn.metrics import accuracy_score
import lightgbm as lgb

XGBoost is a powerful machine learning algorithm especially where speed and accuracy are concerned

In [0]:
reg = xgb.XGBRegressor(n_estimators=1000)
reg.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        early_stopping_rounds=50,
       verbose=True) # Change verbose to True if you want to see it train
[05:24:47] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:105.985	validation_1-rmse:105.265
Multiple eval metrics have been passed: 'validation_1-rmse' will be used for early stopping.

Will train until validation_1-rmse hasn't improved in 50 rounds.
[1]	validation_0-rmse:103.316	validation_1-rmse:102.601
[2]	validation_0-rmse:100.836	validation_1-rmse:100.22
[3]	validation_0-rmse:99.334	validation_1-rmse:98.7886
[4]	validation_0-rmse:97.1046	validation_1-rmse:96.6522
[5]	validation_0-rmse:95.5506	validation_1-rmse:95.1001
[6]	validation_0-rmse:94.5212	validation_1-rmse:94.0848
[7]	validation_0-rmse:93.0496	validation_1-rmse:92.6085
[8]	validation_0-rmse:92.2649	validation_1-rmse:91.8528
[9]	validation_0-rmse:91.1959	validation_1-rmse:90.8205
[10]	validation_0-rmse:90.481	validation_1-rmse:90.1765
[11]	validation_0-rmse:89.7247	validation_1-rmse:89.417
[12]	validation_0-rmse:89.0642	validation_1-rmse:88.6721
[13]	validation_0-rmse:88.2894	validation_1-rmse:87.9195
[14]	validation_0-rmse:87.8166	validation_1-rmse:87.3825
[15]	validation_0-rmse:87.1648	validation_1-rmse:86.7607
[16]	validation_0-rmse:86.0486	validation_1-rmse:85.7125
[17]	validation_0-rmse:85.6851	validation_1-rmse:85.3383
[18]	validation_0-rmse:85.2429	validation_1-rmse:84.8432
[19]	validation_0-rmse:84.8845	validation_1-rmse:84.4562
[20]	validation_0-rmse:84.5982	validation_1-rmse:84.1112
[21]	validation_0-rmse:84.3694	validation_1-rmse:83.8806
[22]	validation_0-rmse:84.0018	validation_1-rmse:83.5306
[23]	validation_0-rmse:83.1559	validation_1-rmse:82.714
[24]	validation_0-rmse:82.7753	validation_1-rmse:82.3737
[25]	validation_0-rmse:82.6107	validation_1-rmse:82.211
[26]	validation_0-rmse:82.3218	validation_1-rmse:81.8774
[27]	validation_0-rmse:82.1959	validation_1-rmse:81.7365
[28]	validation_0-rmse:81.9417	validation_1-rmse:81.5083
[29]	validation_0-rmse:81.6339	validation_1-rmse:81.1866
[30]	validation_0-rmse:81.3042	validation_1-rmse:80.8451
[31]	validation_0-rmse:81.0327	validation_1-rmse:80.589
[32]	validation_0-rmse:80.0547	validation_1-rmse:79.6366
[33]	validation_0-rmse:79.4941	validation_1-rmse:79.0754
[34]	validation_0-rmse:79.2234	validation_1-rmse:78.7989
[35]	validation_0-rmse:79.1191	validation_1-rmse:78.7024
[36]	validation_0-rmse:78.9493	validation_1-rmse:78.5355
[37]	validation_0-rmse:78.7243	validation_1-rmse:78.3309
[38]	validation_0-rmse:77.5846	validation_1-rmse:77.1995
[39]	validation_0-rmse:76.8717	validation_1-rmse:76.4891
[40]	validation_0-rmse:76.8048	validation_1-rmse:76.411
[41]	validation_0-rmse:76.6651	validation_1-rmse:76.2638
[42]	validation_0-rmse:76.1268	validation_1-rmse:75.727
[43]	validation_0-rmse:75.1409	validation_1-rmse:74.7717
[44]	validation_0-rmse:74.926	validation_1-rmse:74.5759
[45]	validation_0-rmse:74.8144	validation_1-rmse:74.471
[46]	validation_0-rmse:74.6937	validation_1-rmse:74.3479
[47]	validation_0-rmse:74.6101	validation_1-rmse:74.2682
[48]	validation_0-rmse:74.5624	validation_1-rmse:74.2133
[49]	validation_0-rmse:73.7647	validation_1-rmse:73.4385
[50]	validation_0-rmse:73.5122	validation_1-rmse:73.1908
[51]	validation_0-rmse:73.2014	validation_1-rmse:72.8705
[52]	validation_0-rmse:72.9743	validation_1-rmse:72.6629
[53]	validation_0-rmse:72.8182	validation_1-rmse:72.5076
[54]	validation_0-rmse:72.7862	validation_1-rmse:72.4639
[55]	validation_0-rmse:72.1998	validation_1-rmse:71.8813
[56]	validation_0-rmse:72.1145	validation_1-rmse:71.7905
[57]	validation_0-rmse:72.0637	validation_1-rmse:71.734
[58]	validation_0-rmse:71.4275	validation_1-rmse:71.1259
[59]	validation_0-rmse:71.3139	validation_1-rmse:71.009
[60]	validation_0-rmse:71.2489	validation_1-rmse:70.9451
[61]	validation_0-rmse:70.9719	validation_1-rmse:70.6631
[62]	validation_0-rmse:70.825	validation_1-rmse:70.5177
[63]	validation_0-rmse:70.7717	validation_1-rmse:70.4591
[64]	validation_0-rmse:70.2851	validation_1-rmse:69.9723
[65]	validation_0-rmse:69.7368	validation_1-rmse:69.4451
[66]	validation_0-rmse:69.252	validation_1-rmse:68.9892
[67]	validation_0-rmse:69.1551	validation_1-rmse:68.8935
[68]	validation_0-rmse:69.06	validation_1-rmse:68.7976
[69]	validation_0-rmse:68.9497	validation_1-rmse:68.6857
[70]	validation_0-rmse:68.7613	validation_1-rmse:68.4974
[71]	validation_0-rmse:68.3634	validation_1-rmse:68.1205
[72]	validation_0-rmse:68.2918	validation_1-rmse:68.0522
[73]	validation_0-rmse:68.1366	validation_1-rmse:67.8965
[74]	validation_0-rmse:68.0835	validation_1-rmse:67.8427
[75]	validation_0-rmse:68.0466	validation_1-rmse:67.8004
[76]	validation_0-rmse:67.6485	validation_1-rmse:67.4048
[77]	validation_0-rmse:67.4414	validation_1-rmse:67.2
[78]	validation_0-rmse:66.8402	validation_1-rmse:66.6288
[79]	validation_0-rmse:66.662	validation_1-rmse:66.446
[80]	validation_0-rmse:66.5512	validation_1-rmse:66.3362
[81]	validation_0-rmse:66.1865	validation_1-rmse:65.9639
[82]	validation_0-rmse:65.8647	validation_1-rmse:65.6466
[83]	validation_0-rmse:65.3697	validation_1-rmse:65.1693
[84]	validation_0-rmse:65.2606	validation_1-rmse:65.0631
[85]	validation_0-rmse:65.2337	validation_1-rmse:65.0348
[86]	validation_0-rmse:65.1531	validation_1-rmse:64.9621
[87]	validation_0-rmse:64.8448	validation_1-rmse:64.6654
[88]	validation_0-rmse:64.5026	validation_1-rmse:64.3208
[89]	validation_0-rmse:64.1246	validation_1-rmse:63.9649
[90]	validation_0-rmse:63.8244	validation_1-rmse:63.6599
[91]	validation_0-rmse:63.7235	validation_1-rmse:63.555
[92]	validation_0-rmse:63.4395	validation_1-rmse:63.2827
[93]	validation_0-rmse:63.2268	validation_1-rmse:63.0676
[94]	validation_0-rmse:63.0707	validation_1-rmse:62.9103
[95]	validation_0-rmse:63.0067	validation_1-rmse:62.844
[96]	validation_0-rmse:62.7582	validation_1-rmse:62.5926
[97]	validation_0-rmse:62.6082	validation_1-rmse:62.4408
[98]	validation_0-rmse:62.5636	validation_1-rmse:62.3934
[99]	validation_0-rmse:62.5202	validation_1-rmse:62.348
[100]	validation_0-rmse:62.2997	validation_1-rmse:62.1298
[101]	validation_0-rmse:62.2717	validation_1-rmse:62.1027
[102]	validation_0-rmse:62.1402	validation_1-rmse:61.9707
[103]	validation_0-rmse:61.7989	validation_1-rmse:61.6518
[104]	validation_0-rmse:61.761	validation_1-rmse:61.6118
[105]	validation_0-rmse:61.6896	validation_1-rmse:61.5397
[106]	validation_0-rmse:61.6711	validation_1-rmse:61.5208
[107]	validation_0-rmse:61.4839	validation_1-rmse:61.3322
[108]	validation_0-rmse:61.3871	validation_1-rmse:61.2322
[109]	validation_0-rmse:61.3201	validation_1-rmse:61.1686
[110]	validation_0-rmse:61.166	validation_1-rmse:61.0197
[111]	validation_0-rmse:60.886	validation_1-rmse:60.7541
[112]	validation_0-rmse:60.8715	validation_1-rmse:60.7351
[113]	validation_0-rmse:60.6073	validation_1-rmse:60.4714
[114]	validation_0-rmse:60.4178	validation_1-rmse:60.2782
[115]	validation_0-rmse:60.3857	validation_1-rmse:60.2449
[116]	validation_0-rmse:60.2049	validation_1-rmse:60.0734
[117]	validation_0-rmse:60.1675	validation_1-rmse:60.0349
[118]	validation_0-rmse:60.1164	validation_1-rmse:59.9847
[119]	validation_0-rmse:59.8961	validation_1-rmse:59.7631
[120]	validation_0-rmse:59.4965	validation_1-rmse:59.3618
[121]	validation_0-rmse:59.3257	validation_1-rmse:59.1916
[122]	validation_0-rmse:59.273	validation_1-rmse:59.1411
[123]	validation_0-rmse:59.2545	validation_1-rmse:59.124
[124]	validation_0-rmse:59.1001	validation_1-rmse:58.9805
[125]	validation_0-rmse:58.9779	validation_1-rmse:58.848
[126]	validation_0-rmse:58.7943	validation_1-rmse:58.6639
[127]	validation_0-rmse:58.5886	validation_1-rmse:58.4766
[128]	validation_0-rmse:58.4361	validation_1-rmse:58.3215
[129]	validation_0-rmse:58.393	validation_1-rmse:58.2758
[130]	validation_0-rmse:58.2748	validation_1-rmse:58.1664
[131]	validation_0-rmse:58.1605	validation_1-rmse:58.0486
[132]	validation_0-rmse:58.0697	validation_1-rmse:57.9555
[133]	validation_0-rmse:58.043	validation_1-rmse:57.9276
[134]	validation_0-rmse:58.0215	validation_1-rmse:57.9079
[135]	validation_0-rmse:57.8885	validation_1-rmse:57.7713
[136]	validation_0-rmse:57.8149	validation_1-rmse:57.6956
[137]	validation_0-rmse:57.5798	validation_1-rmse:57.4597
[138]	validation_0-rmse:57.4169	validation_1-rmse:57.3111
[139]	validation_0-rmse:57.2827	validation_1-rmse:57.1765
[140]	validation_0-rmse:57.1153	validation_1-rmse:57.0093
[141]	validation_0-rmse:57.0728	validation_1-rmse:56.9684
[142]	validation_0-rmse:56.9747	validation_1-rmse:56.8721
[143]	validation_0-rmse:56.873	validation_1-rmse:56.7684
[144]	validation_0-rmse:56.8572	validation_1-rmse:56.7526
[145]	validation_0-rmse:56.8467	validation_1-rmse:56.7414
[146]	validation_0-rmse:56.7105	validation_1-rmse:56.6065
[147]	validation_0-rmse:56.6984	validation_1-rmse:56.5964
[148]	validation_0-rmse:56.6759	validation_1-rmse:56.5736
[149]	validation_0-rmse:56.6149	validation_1-rmse:56.5137
[150]	validation_0-rmse:56.4844	validation_1-rmse:56.3845
[151]	validation_0-rmse:56.4486	validation_1-rmse:56.3472
[152]	validation_0-rmse:56.248	validation_1-rmse:56.1467
[153]	validation_0-rmse:56.0122	validation_1-rmse:55.9151
[154]	validation_0-rmse:55.92	validation_1-rmse:55.829
[155]	validation_0-rmse:55.8841	validation_1-rmse:55.7894
[156]	validation_0-rmse:55.7351	validation_1-rmse:55.6407
[157]	validation_0-rmse:55.7248	validation_1-rmse:55.6305
[158]	validation_0-rmse:55.5321	validation_1-rmse:55.4403
[159]	validation_0-rmse:55.3989	validation_1-rmse:55.3228
[160]	validation_0-rmse:55.3809	validation_1-rmse:55.3036
[161]	validation_0-rmse:55.3654	validation_1-rmse:55.2887
[162]	validation_0-rmse:55.1915	validation_1-rmse:55.1127
[163]	validation_0-rmse:55.0745	validation_1-rmse:54.9898
[164]	validation_0-rmse:54.9178	validation_1-rmse:54.8356
[165]	validation_0-rmse:54.6319	validation_1-rmse:54.5487
[166]	validation_0-rmse:54.5286	validation_1-rmse:54.4427
[167]	validation_0-rmse:54.5174	validation_1-rmse:54.433
[168]	validation_0-rmse:54.512	validation_1-rmse:54.4261
[169]	validation_0-rmse:54.2765	validation_1-rmse:54.1919
[170]	validation_0-rmse:54.157	validation_1-rmse:54.0736
[171]	validation_0-rmse:53.9555	validation_1-rmse:53.8801
[172]	validation_0-rmse:53.9121	validation_1-rmse:53.8339
[173]	validation_0-rmse:53.807	validation_1-rmse:53.7314
[174]	validation_0-rmse:53.7809	validation_1-rmse:53.7056
[175]	validation_0-rmse:53.7683	validation_1-rmse:53.6965
[176]	validation_0-rmse:53.6948	validation_1-rmse:53.629
[177]	validation_0-rmse:53.5659	validation_1-rmse:53.5005
[178]	validation_0-rmse:53.4367	validation_1-rmse:53.3714
[179]	validation_0-rmse:53.4094	validation_1-rmse:53.3442
[180]	validation_0-rmse:53.3268	validation_1-rmse:53.2638
[181]	validation_0-rmse:53.1382	validation_1-rmse:53.0747
[182]	validation_0-rmse:52.9859	validation_1-rmse:52.9159
[183]	validation_0-rmse:52.8202	validation_1-rmse:52.7519
[184]	validation_0-rmse:52.7466	validation_1-rmse:52.6764
[185]	validation_0-rmse:52.7401	validation_1-rmse:52.6696
[186]	validation_0-rmse:52.6813	validation_1-rmse:52.6099
[187]	validation_0-rmse:52.5744	validation_1-rmse:52.5044
[188]	validation_0-rmse:52.558	validation_1-rmse:52.4857
[189]	validation_0-rmse:52.5343	validation_1-rmse:52.4609
[190]	validation_0-rmse:52.5186	validation_1-rmse:52.4445
[191]	validation_0-rmse:52.5101	validation_1-rmse:52.4357
[192]	validation_0-rmse:52.4847	validation_1-rmse:52.4106
[193]	validation_0-rmse:52.3301	validation_1-rmse:52.2543
[194]	validation_0-rmse:52.2177	validation_1-rmse:52.1429
[195]	validation_0-rmse:52.1535	validation_1-rmse:52.0789
[196]	validation_0-rmse:52.1	validation_1-rmse:52.0266
[197]	validation_0-rmse:51.9688	validation_1-rmse:51.8879
[198]	validation_0-rmse:51.8256	validation_1-rmse:51.7457
[199]	validation_0-rmse:51.8122	validation_1-rmse:51.7303
[200]	validation_0-rmse:51.6446	validation_1-rmse:51.568
[201]	validation_0-rmse:51.5284	validation_1-rmse:51.4497
[202]	validation_0-rmse:51.4311	validation_1-rmse:51.3523
[203]	validation_0-rmse:51.3377	validation_1-rmse:51.2606
[204]	validation_0-rmse:51.3313	validation_1-rmse:51.254
[205]	validation_0-rmse:51.3175	validation_1-rmse:51.2408
[206]	validation_0-rmse:51.2089	validation_1-rmse:51.1363
[207]	validation_0-rmse:51.1881	validation_1-rmse:51.1162
[208]	validation_0-rmse:51.1653	validation_1-rmse:51.0929
[209]	validation_0-rmse:51.0267	validation_1-rmse:50.9597
[210]	validation_0-rmse:50.9408	validation_1-rmse:50.8748
[211]	validation_0-rmse:50.772	validation_1-rmse:50.707
[212]	validation_0-rmse:50.6207	validation_1-rmse:50.5559
[213]	validation_0-rmse:50.5651	validation_1-rmse:50.5012
[214]	validation_0-rmse:50.5374	validation_1-rmse:50.4732
[215]	validation_0-rmse:50.5321	validation_1-rmse:50.468
[216]	validation_0-rmse:50.3969	validation_1-rmse:50.3392
[217]	validation_0-rmse:50.2722	validation_1-rmse:50.2145
[218]	validation_0-rmse:50.1664	validation_1-rmse:50.105
[219]	validation_0-rmse:50.0746	validation_1-rmse:50.0187
[220]	validation_0-rmse:50.0702	validation_1-rmse:50.0139
[221]	validation_0-rmse:49.9675	validation_1-rmse:49.9123
[222]	validation_0-rmse:49.8291	validation_1-rmse:49.7754
[223]	validation_0-rmse:49.8218	validation_1-rmse:49.768
[224]	validation_0-rmse:49.7534	validation_1-rmse:49.7014
[225]	validation_0-rmse:49.6428	validation_1-rmse:49.5957
[226]	validation_0-rmse:49.5584	validation_1-rmse:49.5122
[227]	validation_0-rmse:49.5538	validation_1-rmse:49.5067
[228]	validation_0-rmse:49.5344	validation_1-rmse:49.4882
[229]	validation_0-rmse:49.4393	validation_1-rmse:49.393
[230]	validation_0-rmse:49.4376	validation_1-rmse:49.391
[231]	validation_0-rmse:49.4184	validation_1-rmse:49.3719
[232]	validation_0-rmse:49.3414	validation_1-rmse:49.2964
[233]	validation_0-rmse:49.2625	validation_1-rmse:49.2187
[234]	validation_0-rmse:49.2562	validation_1-rmse:49.2123
[235]	validation_0-rmse:49.2533	validation_1-rmse:49.2087
[236]	validation_0-rmse:49.1409	validation_1-rmse:49.0971
[237]	validation_0-rmse:49.0269	validation_1-rmse:48.9746
[238]	validation_0-rmse:48.9831	validation_1-rmse:48.9317
[239]	validation_0-rmse:48.9091	validation_1-rmse:48.8602
[240]	validation_0-rmse:48.8932	validation_1-rmse:48.8438
[241]	validation_0-rmse:48.8563	validation_1-rmse:48.8076
[242]	validation_0-rmse:48.7959	validation_1-rmse:48.749
[243]	validation_0-rmse:48.7121	validation_1-rmse:48.6709
[244]	validation_0-rmse:48.6997	validation_1-rmse:48.6572
[245]	validation_0-rmse:48.6464	validation_1-rmse:48.6049
[246]	validation_0-rmse:48.5773	validation_1-rmse:48.5384
[247]	validation_0-rmse:48.5005	validation_1-rmse:48.4646
[248]	validation_0-rmse:48.433	validation_1-rmse:48.3996
[249]	validation_0-rmse:48.3665	validation_1-rmse:48.3343
[250]	validation_0-rmse:48.3031	validation_1-rmse:48.2745
[251]	validation_0-rmse:48.2784	validation_1-rmse:48.2506
[252]	validation_0-rmse:48.2682	validation_1-rmse:48.2408
[253]	validation_0-rmse:48.2614	validation_1-rmse:48.2348
[254]	validation_0-rmse:48.1995	validation_1-rmse:48.1739
[255]	validation_0-rmse:48.197	validation_1-rmse:48.1707
[256]	validation_0-rmse:48.163	validation_1-rmse:48.1364
[257]	validation_0-rmse:48.1023	validation_1-rmse:48.0759
[258]	validation_0-rmse:48.0067	validation_1-rmse:47.9817
[259]	validation_0-rmse:48.0052	validation_1-rmse:47.9801
[260]	validation_0-rmse:47.967	validation_1-rmse:47.9406
[261]	validation_0-rmse:47.9125	validation_1-rmse:47.8871
[262]	validation_0-rmse:47.8447	validation_1-rmse:47.8202
[263]	validation_0-rmse:47.8135	validation_1-rmse:47.79
[264]	validation_0-rmse:47.7681	validation_1-rmse:47.7442
[265]	validation_0-rmse:47.639	validation_1-rmse:47.6133
[266]	validation_0-rmse:47.5869	validation_1-rmse:47.5597
[267]	validation_0-rmse:47.4921	validation_1-rmse:47.4677
[268]	validation_0-rmse:47.4831	validation_1-rmse:47.4586
[269]	validation_0-rmse:47.4374	validation_1-rmse:47.4149
[270]	validation_0-rmse:47.4109	validation_1-rmse:47.3861
[271]	validation_0-rmse:47.371	validation_1-rmse:47.3481
[272]	validation_0-rmse:47.3617	validation_1-rmse:47.3387
[273]	validation_0-rmse:47.3598	validation_1-rmse:47.3356
[274]	validation_0-rmse:47.3034	validation_1-rmse:47.2807
[275]	validation_0-rmse:47.2237	validation_1-rmse:47.1941
[276]	validation_0-rmse:47.152	validation_1-rmse:47.1292
[277]	validation_0-rmse:47.1479	validation_1-rmse:47.1246
[278]	validation_0-rmse:47.1286	validation_1-rmse:47.1054
[279]	validation_0-rmse:47.1264	validation_1-rmse:47.1028
[280]	validation_0-rmse:47.1251	validation_1-rmse:47.1014
[281]	validation_0-rmse:47.0721	validation_1-rmse:47.0476
[282]	validation_0-rmse:47.06	validation_1-rmse:47.035
[283]	validation_0-rmse:47.0266	validation_1-rmse:47.0015
[284]	validation_0-rmse:46.9723	validation_1-rmse:46.947
[285]	validation_0-rmse:46.9331	validation_1-rmse:46.9104
[286]	validation_0-rmse:46.8823	validation_1-rmse:46.8632
[287]	validation_0-rmse:46.8703	validation_1-rmse:46.8506
[288]	validation_0-rmse:46.8671	validation_1-rmse:46.8471
[289]	validation_0-rmse:46.8636	validation_1-rmse:46.8443
[290]	validation_0-rmse:46.8202	validation_1-rmse:46.8053
[291]	validation_0-rmse:46.7563	validation_1-rmse:46.7434
[292]	validation_0-rmse:46.6885	validation_1-rmse:46.674
[293]	validation_0-rmse:46.6732	validation_1-rmse:46.6591
[294]	validation_0-rmse:46.6132	validation_1-rmse:46.5974
[295]	validation_0-rmse:46.564	validation_1-rmse:46.5492
[296]	validation_0-rmse:46.556	validation_1-rmse:46.5409
[297]	validation_0-rmse:46.5008	validation_1-rmse:46.487
[298]	validation_0-rmse:46.4489	validation_1-rmse:46.4349
[299]	validation_0-rmse:46.3804	validation_1-rmse:46.358
[300]	validation_0-rmse:46.3114	validation_1-rmse:46.2854
[301]	validation_0-rmse:46.3005	validation_1-rmse:46.2737
[302]	validation_0-rmse:46.2905	validation_1-rmse:46.2634
[303]	validation_0-rmse:46.2199	validation_1-rmse:46.1928
[304]	validation_0-rmse:46.2086	validation_1-rmse:46.1821
[305]	validation_0-rmse:46.2077	validation_1-rmse:46.1807
[306]	validation_0-rmse:46.1601	validation_1-rmse:46.1318
[307]	validation_0-rmse:46.1537	validation_1-rmse:46.1249
[308]	validation_0-rmse:46.145	validation_1-rmse:46.116
[309]	validation_0-rmse:46.1419	validation_1-rmse:46.1126
[310]	validation_0-rmse:46.0843	validation_1-rmse:46.0559
[311]	validation_0-rmse:46.0258	validation_1-rmse:45.9993
[312]	validation_0-rmse:45.966	validation_1-rmse:45.942
[313]	validation_0-rmse:45.9229	validation_1-rmse:45.8994
[314]	validation_0-rmse:45.9205	validation_1-rmse:45.8962
[315]	validation_0-rmse:45.8519	validation_1-rmse:45.8316
[316]	validation_0-rmse:45.8157	validation_1-rmse:45.7976
[317]	validation_0-rmse:45.7715	validation_1-rmse:45.7533
[318]	validation_0-rmse:45.7599	validation_1-rmse:45.7421
[319]	validation_0-rmse:45.7497	validation_1-rmse:45.7316
[320]	validation_0-rmse:45.7137	validation_1-rmse:45.6951
[321]	validation_0-rmse:45.658	validation_1-rmse:45.6426
[322]	validation_0-rmse:45.6487	validation_1-rmse:45.6332
[323]	validation_0-rmse:45.6199	validation_1-rmse:45.6047
[324]	validation_0-rmse:45.6134	validation_1-rmse:45.5979
[325]	validation_0-rmse:45.5443	validation_1-rmse:45.5319
[326]	validation_0-rmse:45.5201	validation_1-rmse:45.5102
[327]	validation_0-rmse:45.4838	validation_1-rmse:45.4764
[328]	validation_0-rmse:45.45	validation_1-rmse:45.4402
[329]	validation_0-rmse:45.404	validation_1-rmse:45.3945
[330]	validation_0-rmse:45.3751	validation_1-rmse:45.3644
[331]	validation_0-rmse:45.3351	validation_1-rmse:45.3245
[332]	validation_0-rmse:45.279	validation_1-rmse:45.273
[333]	validation_0-rmse:45.2708	validation_1-rmse:45.2642
[334]	validation_0-rmse:45.2205	validation_1-rmse:45.2162
[335]	validation_0-rmse:45.1506	validation_1-rmse:45.1509
[336]	validation_0-rmse:45.1173	validation_1-rmse:45.1154
[337]	validation_0-rmse:45.1135	validation_1-rmse:45.1113
[338]	validation_0-rmse:45.0757	validation_1-rmse:45.0727
[339]	validation_0-rmse:45.0389	validation_1-rmse:45.0353
[340]	validation_0-rmse:44.9938	validation_1-rmse:44.9926
[341]	validation_0-rmse:44.9848	validation_1-rmse:44.9837
[342]	validation_0-rmse:44.9818	validation_1-rmse:44.9804
[343]	validation_0-rmse:44.9592	validation_1-rmse:44.9559
[344]	validation_0-rmse:44.9255	validation_1-rmse:44.9257
[345]	validation_0-rmse:44.9229	validation_1-rmse:44.9232
[346]	validation_0-rmse:44.9055	validation_1-rmse:44.9056
[347]	validation_0-rmse:44.9047	validation_1-rmse:44.9047
[348]	validation_0-rmse:44.8468	validation_1-rmse:44.8513
[349]	validation_0-rmse:44.8099	validation_1-rmse:44.8143
[350]	validation_0-rmse:44.7784	validation_1-rmse:44.7832
[351]	validation_0-rmse:44.7533	validation_1-rmse:44.7588
[352]	validation_0-rmse:44.7009	validation_1-rmse:44.7065
[353]	validation_0-rmse:44.6801	validation_1-rmse:44.6871
[354]	validation_0-rmse:44.6789	validation_1-rmse:44.6858
[355]	validation_0-rmse:44.6317	validation_1-rmse:44.6421
[356]	validation_0-rmse:44.6308	validation_1-rmse:44.6406
[357]	validation_0-rmse:44.608	validation_1-rmse:44.6176
[358]	validation_0-rmse:44.584	validation_1-rmse:44.593
[359]	validation_0-rmse:44.581	validation_1-rmse:44.5903
[360]	validation_0-rmse:44.5564	validation_1-rmse:44.5662
[361]	validation_0-rmse:44.5294	validation_1-rmse:44.5407
[362]	validation_0-rmse:44.493	validation_1-rmse:44.5043
[363]	validation_0-rmse:44.445	validation_1-rmse:44.4583
[364]	validation_0-rmse:44.3547	validation_1-rmse:44.37
[365]	validation_0-rmse:44.3231	validation_1-rmse:44.3385
[366]	validation_0-rmse:44.3208	validation_1-rmse:44.3357
[367]	validation_0-rmse:44.2856	validation_1-rmse:44.2998
[368]	validation_0-rmse:44.2795	validation_1-rmse:44.2941
[369]	validation_0-rmse:44.2788	validation_1-rmse:44.2933
[370]	validation_0-rmse:44.2401	validation_1-rmse:44.256
[371]	validation_0-rmse:44.1683	validation_1-rmse:44.1861
[372]	validation_0-rmse:44.1661	validation_1-rmse:44.1838
[373]	validation_0-rmse:44.1249	validation_1-rmse:44.1406
[374]	validation_0-rmse:44.1204	validation_1-rmse:44.1361
[375]	validation_0-rmse:44.0978	validation_1-rmse:44.1149
[376]	validation_0-rmse:44.0715	validation_1-rmse:44.0884
[377]	validation_0-rmse:44.07	validation_1-rmse:44.0866
[378]	validation_0-rmse:44.0688	validation_1-rmse:44.0855
[379]	validation_0-rmse:44.0669	validation_1-rmse:44.0838
[380]	validation_0-rmse:44.032	validation_1-rmse:44.0477
[381]	validation_0-rmse:44.0223	validation_1-rmse:44.0378
[382]	validation_0-rmse:44.0211	validation_1-rmse:44.0364
[383]	validation_0-rmse:43.9836	validation_1-rmse:44.0003
[384]	validation_0-rmse:43.9646	validation_1-rmse:43.9811
[385]	validation_0-rmse:43.9342	validation_1-rmse:43.9513
[386]	validation_0-rmse:43.8995	validation_1-rmse:43.9178
[387]	validation_0-rmse:43.8391	validation_1-rmse:43.8593
[388]	validation_0-rmse:43.8195	validation_1-rmse:43.8402
[389]	validation_0-rmse:43.7809	validation_1-rmse:43.8015
[390]	validation_0-rmse:43.7782	validation_1-rmse:43.7988
[391]	validation_0-rmse:43.7292	validation_1-rmse:43.7517
[392]	validation_0-rmse:43.7121	validation_1-rmse:43.7359
[393]	validation_0-rmse:43.702	validation_1-rmse:43.7265
[394]	validation_0-rmse:43.6659	validation_1-rmse:43.6891
[395]	validation_0-rmse:43.657	validation_1-rmse:43.6804
[396]	validation_0-rmse:43.6502	validation_1-rmse:43.6733
[397]	validation_0-rmse:43.6495	validation_1-rmse:43.6727
[398]	validation_0-rmse:43.635	validation_1-rmse:43.6592
[399]	validation_0-rmse:43.6135	validation_1-rmse:43.6382
[400]	validation_0-rmse:43.5651	validation_1-rmse:43.5896
[401]	validation_0-rmse:43.538	validation_1-rmse:43.5642
[402]	validation_0-rmse:43.5067	validation_1-rmse:43.5325
[403]	validation_0-rmse:43.4665	validation_1-rmse:43.4941
[404]	validation_0-rmse:43.4382	validation_1-rmse:43.4676
[405]	validation_0-rmse:43.406	validation_1-rmse:43.437
[406]	validation_0-rmse:43.3669	validation_1-rmse:43.4018
[407]	validation_0-rmse:43.3397	validation_1-rmse:43.3737
[408]	validation_0-rmse:43.3252	validation_1-rmse:43.3593
[409]	validation_0-rmse:43.2917	validation_1-rmse:43.3278
[410]	validation_0-rmse:43.2907	validation_1-rmse:43.327
[411]	validation_0-rmse:43.2637	validation_1-rmse:43.3001
[412]	validation_0-rmse:43.2575	validation_1-rmse:43.2941
[413]	validation_0-rmse:43.2125	validation_1-rmse:43.2502
[414]	validation_0-rmse:43.1935	validation_1-rmse:43.231
[415]	validation_0-rmse:43.1886	validation_1-rmse:43.2266
[416]	validation_0-rmse:43.1795	validation_1-rmse:43.2176
[417]	validation_0-rmse:43.1567	validation_1-rmse:43.195
[418]	validation_0-rmse:43.1527	validation_1-rmse:43.1907
[419]	validation_0-rmse:43.1254	validation_1-rmse:43.1624
[420]	validation_0-rmse:43.1152	validation_1-rmse:43.1517
[421]	validation_0-rmse:43.1036	validation_1-rmse:43.1412
[422]	validation_0-rmse:43.0983	validation_1-rmse:43.1352
[423]	validation_0-rmse:43.0952	validation_1-rmse:43.1323
[424]	validation_0-rmse:43.0944	validation_1-rmse:43.131
[425]	validation_0-rmse:43.0808	validation_1-rmse:43.1181
[426]	validation_0-rmse:43.0561	validation_1-rmse:43.0941
[427]	validation_0-rmse:43.0325	validation_1-rmse:43.0705
[428]	validation_0-rmse:43.0098	validation_1-rmse:43.0479
[429]	validation_0-rmse:42.9733	validation_1-rmse:43.0129
[430]	validation_0-rmse:42.9355	validation_1-rmse:42.9808
[431]	validation_0-rmse:42.914	validation_1-rmse:42.9588
[432]	validation_0-rmse:42.9129	validation_1-rmse:42.9577
[433]	validation_0-rmse:42.8938	validation_1-rmse:42.9389
[434]	validation_0-rmse:42.879	validation_1-rmse:42.9249
[435]	validation_0-rmse:42.854	validation_1-rmse:42.9006
[436]	validation_0-rmse:42.8233	validation_1-rmse:42.8697
[437]	validation_0-rmse:42.8075	validation_1-rmse:42.8537
[438]	validation_0-rmse:42.7879	validation_1-rmse:42.8348
[439]	validation_0-rmse:42.7858	validation_1-rmse:42.8332
[440]	validation_0-rmse:42.7665	validation_1-rmse:42.813
[441]	validation_0-rmse:42.7413	validation_1-rmse:42.7907
[442]	validation_0-rmse:42.7097	validation_1-rmse:42.7615
[443]	validation_0-rmse:42.6851	validation_1-rmse:42.7375
[444]	validation_0-rmse:42.6775	validation_1-rmse:42.7303
[445]	validation_0-rmse:42.623	validation_1-rmse:42.6781
[446]	validation_0-rmse:42.6163	validation_1-rmse:42.6717
[447]	validation_0-rmse:42.6123	validation_1-rmse:42.6676
[448]	validation_0-rmse:42.5576	validation_1-rmse:42.6146
[449]	validation_0-rmse:42.5396	validation_1-rmse:42.5964
[450]	validation_0-rmse:42.514	validation_1-rmse:42.5712
[451]	validation_0-rmse:42.4949	validation_1-rmse:42.5547
[452]	validation_0-rmse:42.4931	validation_1-rmse:42.5527
[453]	validation_0-rmse:42.4923	validation_1-rmse:42.552
[454]	validation_0-rmse:42.4912	validation_1-rmse:42.551
[455]	validation_0-rmse:42.4904	validation_1-rmse:42.5502
[456]	validation_0-rmse:42.4458	validation_1-rmse:42.5073
[457]	validation_0-rmse:42.4436	validation_1-rmse:42.5052
[458]	validation_0-rmse:42.4072	validation_1-rmse:42.4709
[459]	validation_0-rmse:42.3994	validation_1-rmse:42.4631
[460]	validation_0-rmse:42.3908	validation_1-rmse:42.4539
[461]	validation_0-rmse:42.3569	validation_1-rmse:42.4214
[462]	validation_0-rmse:42.354	validation_1-rmse:42.4184
[463]	validation_0-rmse:42.3486	validation_1-rmse:42.4134
[464]	validation_0-rmse:42.3467	validation_1-rmse:42.4118
[465]	validation_0-rmse:42.3219	validation_1-rmse:42.3909
[466]	validation_0-rmse:42.2921	validation_1-rmse:42.3621
[467]	validation_0-rmse:42.2913	validation_1-rmse:42.3613
[468]	validation_0-rmse:42.2752	validation_1-rmse:42.3441
[469]	validation_0-rmse:42.2689	validation_1-rmse:42.3373
[470]	validation_0-rmse:42.2683	validation_1-rmse:42.336
[471]	validation_0-rmse:42.2439	validation_1-rmse:42.3126
[472]	validation_0-rmse:42.2408	validation_1-rmse:42.3097
[473]	validation_0-rmse:42.2231	validation_1-rmse:42.2917
[474]	validation_0-rmse:42.2185	validation_1-rmse:42.2875
[475]	validation_0-rmse:42.1991	validation_1-rmse:42.2689
[476]	validation_0-rmse:42.1952	validation_1-rmse:42.2646
[477]	validation_0-rmse:42.1869	validation_1-rmse:42.2563
[478]	validation_0-rmse:42.1852	validation_1-rmse:42.2543
[479]	validation_0-rmse:42.1598	validation_1-rmse:42.2303
[480]	validation_0-rmse:42.1577	validation_1-rmse:42.2284
[481]	validation_0-rmse:42.143	validation_1-rmse:42.2145
[482]	validation_0-rmse:42.1311	validation_1-rmse:42.2022
[483]	validation_0-rmse:42.0968	validation_1-rmse:42.1689
[484]	validation_0-rmse:42.0839	validation_1-rmse:42.1557
[485]	validation_0-rmse:42.0801	validation_1-rmse:42.1525
[486]	validation_0-rmse:42.0367	validation_1-rmse:42.1088
[487]	validation_0-rmse:42.022	validation_1-rmse:42.0937
[488]	validation_0-rmse:42.0067	validation_1-rmse:42.0792
[489]	validation_0-rmse:42.0011	validation_1-rmse:42.0733
[490]	validation_0-rmse:41.9983	validation_1-rmse:42.0708
[491]	validation_0-rmse:41.9785	validation_1-rmse:42.0519
[492]	validation_0-rmse:41.9768	validation_1-rmse:42.0498
[493]	validation_0-rmse:41.9608	validation_1-rmse:42.0339
[494]	validation_0-rmse:41.9603	validation_1-rmse:42.0333
[495]	validation_0-rmse:41.9529	validation_1-rmse:42.0267
[496]	validation_0-rmse:41.9465	validation_1-rmse:42.0205
[497]	validation_0-rmse:41.92	validation_1-rmse:41.9939
[498]	validation_0-rmse:41.9036	validation_1-rmse:41.9783
[499]	validation_0-rmse:41.8886	validation_1-rmse:41.9636
[500]	validation_0-rmse:41.8677	validation_1-rmse:41.9404
[501]	validation_0-rmse:41.8543	validation_1-rmse:41.9274
[502]	validation_0-rmse:41.8536	validation_1-rmse:41.9268
[503]	validation_0-rmse:41.8506	validation_1-rmse:41.9242
[504]	validation_0-rmse:41.8463	validation_1-rmse:41.9199
[505]	validation_0-rmse:41.8247	validation_1-rmse:41.8988
[506]	validation_0-rmse:41.8239	validation_1-rmse:41.8981
[507]	validation_0-rmse:41.7886	validation_1-rmse:41.8623
[508]	validation_0-rmse:41.7866	validation_1-rmse:41.8606
[509]	validation_0-rmse:41.769	validation_1-rmse:41.8434
[510]	validation_0-rmse:41.7658	validation_1-rmse:41.8399
[511]	validation_0-rmse:41.755	validation_1-rmse:41.8295
[512]	validation_0-rmse:41.7523	validation_1-rmse:41.8272
[513]	validation_0-rmse:41.7189	validation_1-rmse:41.789
[514]	validation_0-rmse:41.7094	validation_1-rmse:41.7787
[515]	validation_0-rmse:41.7061	validation_1-rmse:41.7759
[516]	validation_0-rmse:41.7041	validation_1-rmse:41.774
[517]	validation_0-rmse:41.7031	validation_1-rmse:41.7731
[518]	validation_0-rmse:41.7013	validation_1-rmse:41.7716
[519]	validation_0-rmse:41.6832	validation_1-rmse:41.7541
[520]	validation_0-rmse:41.6824	validation_1-rmse:41.7533
[521]	validation_0-rmse:41.6803	validation_1-rmse:41.7512
[522]	validation_0-rmse:41.6774	validation_1-rmse:41.7481
[523]	validation_0-rmse:41.6759	validation_1-rmse:41.7468
[524]	validation_0-rmse:41.6754	validation_1-rmse:41.7463
[525]	validation_0-rmse:41.6711	validation_1-rmse:41.7419
[526]	validation_0-rmse:41.6693	validation_1-rmse:41.74
[527]	validation_0-rmse:41.6613	validation_1-rmse:41.7318
[528]	validation_0-rmse:41.5933	validation_1-rmse:41.6684
[529]	validation_0-rmse:41.5621	validation_1-rmse:41.6309
[530]	validation_0-rmse:41.5065	validation_1-rmse:41.5787
[531]	validation_0-rmse:41.4895	validation_1-rmse:41.563
[532]	validation_0-rmse:41.4619	validation_1-rmse:41.5311
[533]	validation_0-rmse:41.4255	validation_1-rmse:41.4944
[534]	validation_0-rmse:41.411	validation_1-rmse:41.4804
[535]	validation_0-rmse:41.3931	validation_1-rmse:41.463
[536]	validation_0-rmse:41.3811	validation_1-rmse:41.4509
[537]	validation_0-rmse:41.3553	validation_1-rmse:41.4266
[538]	validation_0-rmse:41.3254	validation_1-rmse:41.397
[539]	validation_0-rmse:41.3236	validation_1-rmse:41.3953
[540]	validation_0-rmse:41.3112	validation_1-rmse:41.3827
[541]	validation_0-rmse:41.2968	validation_1-rmse:41.3683
[542]	validation_0-rmse:41.2796	validation_1-rmse:41.3519
[543]	validation_0-rmse:41.2775	validation_1-rmse:41.3498
[544]	validation_0-rmse:41.2625	validation_1-rmse:41.3355
[545]	validation_0-rmse:41.2381	validation_1-rmse:41.3114
[546]	validation_0-rmse:41.2169	validation_1-rmse:41.2933
[547]	validation_0-rmse:41.2085	validation_1-rmse:41.2846
[548]	validation_0-rmse:41.1983	validation_1-rmse:41.2751
[549]	validation_0-rmse:41.167	validation_1-rmse:41.2384
[550]	validation_0-rmse:41.1586	validation_1-rmse:41.2299
[551]	validation_0-rmse:41.146	validation_1-rmse:41.2182
[552]	validation_0-rmse:41.1429	validation_1-rmse:41.2152
[553]	validation_0-rmse:41.1363	validation_1-rmse:41.2089
[554]	validation_0-rmse:41.1287	validation_1-rmse:41.2012
[555]	validation_0-rmse:41.121	validation_1-rmse:41.1941
[556]	validation_0-rmse:41.1087	validation_1-rmse:41.1814
[557]	validation_0-rmse:41.0396	validation_1-rmse:41.1102
[558]	validation_0-rmse:41.0141	validation_1-rmse:41.084
[559]	validation_0-rmse:41.0122	validation_1-rmse:41.0818
[560]	validation_0-rmse:41.0016	validation_1-rmse:41.0712
[561]	validation_0-rmse:40.9648	validation_1-rmse:41.0304
[562]	validation_0-rmse:40.9431	validation_1-rmse:41.0049
[563]	validation_0-rmse:40.8887	validation_1-rmse:40.9493
[564]	validation_0-rmse:40.8779	validation_1-rmse:40.9385
[565]	validation_0-rmse:40.8697	validation_1-rmse:40.9301
[566]	validation_0-rmse:40.8636	validation_1-rmse:40.9242
[567]	validation_0-rmse:40.8607	validation_1-rmse:40.9211
[568]	validation_0-rmse:40.8519	validation_1-rmse:40.912
[569]	validation_0-rmse:40.8463	validation_1-rmse:40.9067
[570]	validation_0-rmse:40.8032	validation_1-rmse:40.8641
[571]	validation_0-rmse:40.7932	validation_1-rmse:40.8545
[572]	validation_0-rmse:40.7912	validation_1-rmse:40.8524
[573]	validation_0-rmse:40.7591	validation_1-rmse:40.8224
[574]	validation_0-rmse:40.7524	validation_1-rmse:40.8159
[575]	validation_0-rmse:40.7299	validation_1-rmse:40.7925
[576]	validation_0-rmse:40.7258	validation_1-rmse:40.7888
[577]	validation_0-rmse:40.7253	validation_1-rmse:40.7883
[578]	validation_0-rmse:40.6504	validation_1-rmse:40.7137
[579]	validation_0-rmse:40.6317	validation_1-rmse:40.6969
[580]	validation_0-rmse:40.6166	validation_1-rmse:40.684
[581]	validation_0-rmse:40.6113	validation_1-rmse:40.6782
[582]	validation_0-rmse:40.6046	validation_1-rmse:40.6716
[583]	validation_0-rmse:40.5932	validation_1-rmse:40.6604
[584]	validation_0-rmse:40.5924	validation_1-rmse:40.6597
[585]	validation_0-rmse:40.5868	validation_1-rmse:40.6539
[586]	validation_0-rmse:40.5577	validation_1-rmse:40.62
[587]	validation_0-rmse:40.5555	validation_1-rmse:40.6179
[588]	validation_0-rmse:40.5522	validation_1-rmse:40.6149
[589]	validation_0-rmse:40.5438	validation_1-rmse:40.6068
[590]	validation_0-rmse:40.5225	validation_1-rmse:40.5823
[591]	validation_0-rmse:40.5129	validation_1-rmse:40.5732
[592]	validation_0-rmse:40.5112	validation_1-rmse:40.5714
[593]	validation_0-rmse:40.5093	validation_1-rmse:40.5696
[594]	validation_0-rmse:40.5076	validation_1-rmse:40.5675
[595]	validation_0-rmse:40.5066	validation_1-rmse:40.5666
[596]	validation_0-rmse:40.5056	validation_1-rmse:40.5654
[597]	validation_0-rmse:40.5042	validation_1-rmse:40.5638
[598]	validation_0-rmse:40.4982	validation_1-rmse:40.5582
[599]	validation_0-rmse:40.4968	validation_1-rmse:40.5565
[600]	validation_0-rmse:40.4962	validation_1-rmse:40.5561
[601]	validation_0-rmse:40.4783	validation_1-rmse:40.5347
[602]	validation_0-rmse:40.4656	validation_1-rmse:40.5236
[603]	validation_0-rmse:40.451	validation_1-rmse:40.509
[604]	validation_0-rmse:40.4464	validation_1-rmse:40.5055
[605]	validation_0-rmse:40.4153	validation_1-rmse:40.4771
[606]	validation_0-rmse:40.4104	validation_1-rmse:40.4727
[607]	validation_0-rmse:40.4068	validation_1-rmse:40.4701
[608]	validation_0-rmse:40.3978	validation_1-rmse:40.4609
[609]	validation_0-rmse:40.3903	validation_1-rmse:40.4535
[610]	validation_0-rmse:40.3724	validation_1-rmse:40.4349
[611]	validation_0-rmse:40.3708	validation_1-rmse:40.4336
[612]	validation_0-rmse:40.3695	validation_1-rmse:40.4321
[613]	validation_0-rmse:40.3662	validation_1-rmse:40.4288
[614]	validation_0-rmse:40.3644	validation_1-rmse:40.4269
[615]	validation_0-rmse:40.3571	validation_1-rmse:40.4196
[616]	validation_0-rmse:40.3362	validation_1-rmse:40.3989
[617]	validation_0-rmse:40.3255	validation_1-rmse:40.3884
[618]	validation_0-rmse:40.3162	validation_1-rmse:40.379
[619]	validation_0-rmse:40.3157	validation_1-rmse:40.3785
[620]	validation_0-rmse:40.3137	validation_1-rmse:40.3765
[621]	validation_0-rmse:40.3049	validation_1-rmse:40.3676
[622]	validation_0-rmse:40.3036	validation_1-rmse:40.3664
[623]	validation_0-rmse:40.2881	validation_1-rmse:40.3497
[624]	validation_0-rmse:40.2865	validation_1-rmse:40.3484
[625]	validation_0-rmse:40.2851	validation_1-rmse:40.3469
[626]	validation_0-rmse:40.2743	validation_1-rmse:40.3362
[627]	validation_0-rmse:40.2739	validation_1-rmse:40.3357
[628]	validation_0-rmse:40.273	validation_1-rmse:40.3348
[629]	validation_0-rmse:40.2719	validation_1-rmse:40.334
[630]	validation_0-rmse:40.2715	validation_1-rmse:40.3335
[631]	validation_0-rmse:40.2699	validation_1-rmse:40.332
[632]	validation_0-rmse:40.2619	validation_1-rmse:40.3241
[633]	validation_0-rmse:40.2602	validation_1-rmse:40.3225
[634]	validation_0-rmse:40.2593	validation_1-rmse:40.3211
[635]	validation_0-rmse:40.2525	validation_1-rmse:40.3143
[636]	validation_0-rmse:40.2502	validation_1-rmse:40.3119
[637]	validation_0-rmse:40.2488	validation_1-rmse:40.3108
[638]	validation_0-rmse:40.2472	validation_1-rmse:40.3091
[639]	validation_0-rmse:40.2398	validation_1-rmse:40.3015
[640]	validation_0-rmse:40.2357	validation_1-rmse:40.2971
[641]	validation_0-rmse:40.2286	validation_1-rmse:40.2896
[642]	validation_0-rmse:40.218	validation_1-rmse:40.2802
[643]	validation_0-rmse:40.2145	validation_1-rmse:40.2767
[644]	validation_0-rmse:40.206	validation_1-rmse:40.2698
[645]	validation_0-rmse:40.1981	validation_1-rmse:40.2616
[646]	validation_0-rmse:40.194	validation_1-rmse:40.2573
[647]	validation_0-rmse:40.1807	validation_1-rmse:40.2432
[648]	validation_0-rmse:40.1685	validation_1-rmse:40.2316
[649]	validation_0-rmse:40.1668	validation_1-rmse:40.23
[650]	validation_0-rmse:40.1506	validation_1-rmse:40.2155
[651]	validation_0-rmse:40.1493	validation_1-rmse:40.214
[652]	validation_0-rmse:40.1402	validation_1-rmse:40.2058
[653]	validation_0-rmse:40.1316	validation_1-rmse:40.1983
[654]	validation_0-rmse:40.1253	validation_1-rmse:40.1918
[655]	validation_0-rmse:40.1023	validation_1-rmse:40.1693
[656]	validation_0-rmse:40.0893	validation_1-rmse:40.1564
[657]	validation_0-rmse:40.0753	validation_1-rmse:40.1436
[658]	validation_0-rmse:40.0145	validation_1-rmse:40.0826
[659]	validation_0-rmse:40.0131	validation_1-rmse:40.0812
[660]	validation_0-rmse:40.0097	validation_1-rmse:40.0783
[661]	validation_0-rmse:40.0075	validation_1-rmse:40.0759
[662]	validation_0-rmse:40.0021	validation_1-rmse:40.0704
[663]	validation_0-rmse:39.9914	validation_1-rmse:40.06
[664]	validation_0-rmse:39.9812	validation_1-rmse:40.0499
[665]	validation_0-rmse:39.9564	validation_1-rmse:40.0245
[666]	validation_0-rmse:39.9559	validation_1-rmse:40.0242
[667]	validation_0-rmse:39.9356	validation_1-rmse:40.0033
[668]	validation_0-rmse:39.9352	validation_1-rmse:40.003
[669]	validation_0-rmse:39.9341	validation_1-rmse:40.0016
[670]	validation_0-rmse:39.9204	validation_1-rmse:39.9904
[671]	validation_0-rmse:39.9085	validation_1-rmse:39.9791
[672]	validation_0-rmse:39.8818	validation_1-rmse:39.9563
[673]	validation_0-rmse:39.8805	validation_1-rmse:39.9552
[674]	validation_0-rmse:39.8798	validation_1-rmse:39.9545
[675]	validation_0-rmse:39.879	validation_1-rmse:39.9538
[676]	validation_0-rmse:39.8782	validation_1-rmse:39.9534
[677]	validation_0-rmse:39.8726	validation_1-rmse:39.9479
[678]	validation_0-rmse:39.8712	validation_1-rmse:39.9467
[679]	validation_0-rmse:39.8708	validation_1-rmse:39.9463
[680]	validation_0-rmse:39.8525	validation_1-rmse:39.9238
[681]	validation_0-rmse:39.8338	validation_1-rmse:39.9071
[682]	validation_0-rmse:39.8326	validation_1-rmse:39.9056
[683]	validation_0-rmse:39.8247	validation_1-rmse:39.8976
[684]	validation_0-rmse:39.8241	validation_1-rmse:39.897
[685]	validation_0-rmse:39.8217	validation_1-rmse:39.894
[686]	validation_0-rmse:39.7961	validation_1-rmse:39.871
[687]	validation_0-rmse:39.7692	validation_1-rmse:39.8466
[688]	validation_0-rmse:39.7552	validation_1-rmse:39.8354
[689]	validation_0-rmse:39.7543	validation_1-rmse:39.8348
[690]	validation_0-rmse:39.7476	validation_1-rmse:39.828
[691]	validation_0-rmse:39.7377	validation_1-rmse:39.8174
[692]	validation_0-rmse:39.7376	validation_1-rmse:39.8174
[693]	validation_0-rmse:39.7342	validation_1-rmse:39.8138
[694]	validation_0-rmse:39.7322	validation_1-rmse:39.8117
[695]	validation_0-rmse:39.731	validation_1-rmse:39.811
[696]	validation_0-rmse:39.7301	validation_1-rmse:39.8102
[697]	validation_0-rmse:39.7291	validation_1-rmse:39.8092
[698]	validation_0-rmse:39.694	validation_1-rmse:39.7748
[699]	validation_0-rmse:39.6881	validation_1-rmse:39.7685
[700]	validation_0-rmse:39.6742	validation_1-rmse:39.7551
[701]	validation_0-rmse:39.667	validation_1-rmse:39.7476
[702]	validation_0-rmse:39.6452	validation_1-rmse:39.7257
[703]	validation_0-rmse:39.6392	validation_1-rmse:39.7195
[704]	validation_0-rmse:39.6207	validation_1-rmse:39.7013
[705]	validation_0-rmse:39.6083	validation_1-rmse:39.6899
[706]	validation_0-rmse:39.5796	validation_1-rmse:39.6614
[707]	validation_0-rmse:39.5709	validation_1-rmse:39.6522
[708]	validation_0-rmse:39.5621	validation_1-rmse:39.6436
[709]	validation_0-rmse:39.557	validation_1-rmse:39.6387
[710]	validation_0-rmse:39.5536	validation_1-rmse:39.6352
[711]	validation_0-rmse:39.5495	validation_1-rmse:39.6311
[712]	validation_0-rmse:39.5345	validation_1-rmse:39.6177
[713]	validation_0-rmse:39.5319	validation_1-rmse:39.6152
[714]	validation_0-rmse:39.5221	validation_1-rmse:39.6073
[715]	validation_0-rmse:39.5198	validation_1-rmse:39.6053
[716]	validation_0-rmse:39.5175	validation_1-rmse:39.603
[717]	validation_0-rmse:39.5095	validation_1-rmse:39.5953
[718]	validation_0-rmse:39.5086	validation_1-rmse:39.5946
[719]	validation_0-rmse:39.4934	validation_1-rmse:39.5795
[720]	validation_0-rmse:39.4765	validation_1-rmse:39.5621
[721]	validation_0-rmse:39.4716	validation_1-rmse:39.5565
[722]	validation_0-rmse:39.4325	validation_1-rmse:39.5196
[723]	validation_0-rmse:39.4321	validation_1-rmse:39.5193
[724]	validation_0-rmse:39.428	validation_1-rmse:39.5155
[725]	validation_0-rmse:39.4272	validation_1-rmse:39.5147
[726]	validation_0-rmse:39.426	validation_1-rmse:39.5133
[727]	validation_0-rmse:39.411	validation_1-rmse:39.4981
[728]	validation_0-rmse:39.4077	validation_1-rmse:39.4946
[729]	validation_0-rmse:39.3787	validation_1-rmse:39.4661
[730]	validation_0-rmse:39.378	validation_1-rmse:39.4655
[731]	validation_0-rmse:39.3776	validation_1-rmse:39.4652
[732]	validation_0-rmse:39.3633	validation_1-rmse:39.4509
[733]	validation_0-rmse:39.3624	validation_1-rmse:39.4499
[734]	validation_0-rmse:39.3579	validation_1-rmse:39.4445
[735]	validation_0-rmse:39.3516	validation_1-rmse:39.4389
[736]	validation_0-rmse:39.3435	validation_1-rmse:39.431
[737]	validation_0-rmse:39.3362	validation_1-rmse:39.4244
[738]	validation_0-rmse:39.3121	validation_1-rmse:39.4009
[739]	validation_0-rmse:39.3105	validation_1-rmse:39.399
[740]	validation_0-rmse:39.3004	validation_1-rmse:39.3891
[741]	validation_0-rmse:39.2984	validation_1-rmse:39.3875
[742]	validation_0-rmse:39.2847	validation_1-rmse:39.3731
[743]	validation_0-rmse:39.269	validation_1-rmse:39.3569
[744]	validation_0-rmse:39.2648	validation_1-rmse:39.3524
[745]	validation_0-rmse:39.2495	validation_1-rmse:39.3363
[746]	validation_0-rmse:39.2359	validation_1-rmse:39.3238
[747]	validation_0-rmse:39.2272	validation_1-rmse:39.3158
[748]	validation_0-rmse:39.2267	validation_1-rmse:39.3153
[749]	validation_0-rmse:39.2098	validation_1-rmse:39.2979
[750]	validation_0-rmse:39.1994	validation_1-rmse:39.2873
[751]	validation_0-rmse:39.198	validation_1-rmse:39.2862
[752]	validation_0-rmse:39.1895	validation_1-rmse:39.2772
[753]	validation_0-rmse:39.1811	validation_1-rmse:39.269
[754]	validation_0-rmse:39.1716	validation_1-rmse:39.2599
[755]	validation_0-rmse:39.1688	validation_1-rmse:39.2573
[756]	validation_0-rmse:39.1622	validation_1-rmse:39.2509
[757]	validation_0-rmse:39.1508	validation_1-rmse:39.2404
[758]	validation_0-rmse:39.143	validation_1-rmse:39.2334
[759]	validation_0-rmse:39.1406	validation_1-rmse:39.2311
[760]	validation_0-rmse:39.1393	validation_1-rmse:39.2296
[761]	validation_0-rmse:39.1388	validation_1-rmse:39.229
[762]	validation_0-rmse:39.1322	validation_1-rmse:39.2214
[763]	validation_0-rmse:39.1215	validation_1-rmse:39.2132
[764]	validation_0-rmse:39.1044	validation_1-rmse:39.1961
[765]	validation_0-rmse:39.1004	validation_1-rmse:39.1922
[766]	validation_0-rmse:39.0857	validation_1-rmse:39.1795
[767]	validation_0-rmse:39.061	validation_1-rmse:39.1565
[768]	validation_0-rmse:39.0556	validation_1-rmse:39.1512
[769]	validation_0-rmse:39.0461	validation_1-rmse:39.1396
[770]	validation_0-rmse:39.045	validation_1-rmse:39.1387
[771]	validation_0-rmse:39.0443	validation_1-rmse:39.1381
[772]	validation_0-rmse:39.0342	validation_1-rmse:39.1303
[773]	validation_0-rmse:39.0263	validation_1-rmse:39.1229
[774]	validation_0-rmse:39.0129	validation_1-rmse:39.1091
[775]	validation_0-rmse:39.0091	validation_1-rmse:39.1057
[776]	validation_0-rmse:38.9757	validation_1-rmse:39.0712
[777]	validation_0-rmse:38.9728	validation_1-rmse:39.0686
[778]	validation_0-rmse:38.9706	validation_1-rmse:39.0664
[779]	validation_0-rmse:38.9672	validation_1-rmse:39.0626
[780]	validation_0-rmse:38.9554	validation_1-rmse:39.0527
[781]	validation_0-rmse:38.9531	validation_1-rmse:39.0502
[782]	validation_0-rmse:38.9527	validation_1-rmse:39.0499
[783]	validation_0-rmse:38.9523	validation_1-rmse:39.0495
[784]	validation_0-rmse:38.9512	validation_1-rmse:39.0483
[785]	validation_0-rmse:38.9473	validation_1-rmse:39.0442
[786]	validation_0-rmse:38.9454	validation_1-rmse:39.0428
[787]	validation_0-rmse:38.9385	validation_1-rmse:39.0361
[788]	validation_0-rmse:38.9082	validation_1-rmse:39.0084
[789]	validation_0-rmse:38.9065	validation_1-rmse:39.0069
[790]	validation_0-rmse:38.904	validation_1-rmse:39.0039
[791]	validation_0-rmse:38.896	validation_1-rmse:38.9971
[792]	validation_0-rmse:38.8947	validation_1-rmse:38.9957
[793]	validation_0-rmse:38.8889	validation_1-rmse:38.99
[794]	validation_0-rmse:38.8674	validation_1-rmse:38.971
[795]	validation_0-rmse:38.8604	validation_1-rmse:38.9643
[796]	validation_0-rmse:38.8532	validation_1-rmse:38.9584
[797]	validation_0-rmse:38.849	validation_1-rmse:38.9541
[798]	validation_0-rmse:38.8484	validation_1-rmse:38.9535
[799]	validation_0-rmse:38.8378	validation_1-rmse:38.9426
[800]	validation_0-rmse:38.8246	validation_1-rmse:38.9292
[801]	validation_0-rmse:38.8231	validation_1-rmse:38.9278
[802]	validation_0-rmse:38.8174	validation_1-rmse:38.9224
[803]	validation_0-rmse:38.8029	validation_1-rmse:38.9095
[804]	validation_0-rmse:38.7974	validation_1-rmse:38.9039
[805]	validation_0-rmse:38.7912	validation_1-rmse:38.8991
[806]	validation_0-rmse:38.7894	validation_1-rmse:38.897
[807]	validation_0-rmse:38.7851	validation_1-rmse:38.8929
[808]	validation_0-rmse:38.7665	validation_1-rmse:38.8762
[809]	validation_0-rmse:38.7647	validation_1-rmse:38.8737
[810]	validation_0-rmse:38.7609	validation_1-rmse:38.87
[811]	validation_0-rmse:38.7604	validation_1-rmse:38.8697
[812]	validation_0-rmse:38.755	validation_1-rmse:38.8642
[813]	validation_0-rmse:38.754	validation_1-rmse:38.8631
[814]	validation_0-rmse:38.7334	validation_1-rmse:38.8432
[815]	validation_0-rmse:38.7141	validation_1-rmse:38.8269
[816]	validation_0-rmse:38.7121	validation_1-rmse:38.8244
[817]	validation_0-rmse:38.7067	validation_1-rmse:38.8184
[818]	validation_0-rmse:38.7052	validation_1-rmse:38.8174
[819]	validation_0-rmse:38.6955	validation_1-rmse:38.8094
[820]	validation_0-rmse:38.6935	validation_1-rmse:38.8076
[821]	validation_0-rmse:38.6923	validation_1-rmse:38.8064
[822]	validation_0-rmse:38.692	validation_1-rmse:38.8061
[823]	validation_0-rmse:38.6911	validation_1-rmse:38.8051
[824]	validation_0-rmse:38.6904	validation_1-rmse:38.8043
[825]	validation_0-rmse:38.6761	validation_1-rmse:38.7895
[826]	validation_0-rmse:38.671	validation_1-rmse:38.7844
[827]	validation_0-rmse:38.6667	validation_1-rmse:38.7808
[828]	validation_0-rmse:38.6657	validation_1-rmse:38.7799
[829]	validation_0-rmse:38.6653	validation_1-rmse:38.7796
[830]	validation_0-rmse:38.665	validation_1-rmse:38.7793
[831]	validation_0-rmse:38.6628	validation_1-rmse:38.7772
[832]	validation_0-rmse:38.6585	validation_1-rmse:38.7728
[833]	validation_0-rmse:38.6577	validation_1-rmse:38.772
[834]	validation_0-rmse:38.6498	validation_1-rmse:38.7641
[835]	validation_0-rmse:38.6331	validation_1-rmse:38.7477
[836]	validation_0-rmse:38.6239	validation_1-rmse:38.7383
[837]	validation_0-rmse:38.6221	validation_1-rmse:38.7363
[838]	validation_0-rmse:38.6184	validation_1-rmse:38.7323
[839]	validation_0-rmse:38.6155	validation_1-rmse:38.7294
[840]	validation_0-rmse:38.6122	validation_1-rmse:38.7259
[841]	validation_0-rmse:38.6109	validation_1-rmse:38.7247
[842]	validation_0-rmse:38.6007	validation_1-rmse:38.7142
[843]	validation_0-rmse:38.5928	validation_1-rmse:38.7062
[844]	validation_0-rmse:38.5885	validation_1-rmse:38.7024
[845]	validation_0-rmse:38.5801	validation_1-rmse:38.6935
[846]	validation_0-rmse:38.5793	validation_1-rmse:38.6928
[847]	validation_0-rmse:38.5764	validation_1-rmse:38.6891
[848]	validation_0-rmse:38.5753	validation_1-rmse:38.688
[849]	validation_0-rmse:38.57	validation_1-rmse:38.6836
[850]	validation_0-rmse:38.5697	validation_1-rmse:38.6833
[851]	validation_0-rmse:38.5687	validation_1-rmse:38.6825
[852]	validation_0-rmse:38.5663	validation_1-rmse:38.6798
[853]	validation_0-rmse:38.5648	validation_1-rmse:38.6781
[854]	validation_0-rmse:38.5632	validation_1-rmse:38.6764
[855]	validation_0-rmse:38.5481	validation_1-rmse:38.6593
[856]	validation_0-rmse:38.547	validation_1-rmse:38.6583
[857]	validation_0-rmse:38.5454	validation_1-rmse:38.6567
[858]	validation_0-rmse:38.5421	validation_1-rmse:38.6531
[859]	validation_0-rmse:38.5309	validation_1-rmse:38.6427
[860]	validation_0-rmse:38.5114	validation_1-rmse:38.6239
[861]	validation_0-rmse:38.5054	validation_1-rmse:38.6179
[862]	validation_0-rmse:38.4878	validation_1-rmse:38.6005
[863]	validation_0-rmse:38.4785	validation_1-rmse:38.5911
[864]	validation_0-rmse:38.4702	validation_1-rmse:38.5809
[865]	validation_0-rmse:38.4558	validation_1-rmse:38.5666
[866]	validation_0-rmse:38.4509	validation_1-rmse:38.5622
[867]	validation_0-rmse:38.4404	validation_1-rmse:38.5522
[868]	validation_0-rmse:38.4398	validation_1-rmse:38.5517
[869]	validation_0-rmse:38.4357	validation_1-rmse:38.5479
[870]	validation_0-rmse:38.4334	validation_1-rmse:38.546
[871]	validation_0-rmse:38.4261	validation_1-rmse:38.5386
[872]	validation_0-rmse:38.4242	validation_1-rmse:38.537
[873]	validation_0-rmse:38.4144	validation_1-rmse:38.5264
[874]	validation_0-rmse:38.4011	validation_1-rmse:38.511
[875]	validation_0-rmse:38.3946	validation_1-rmse:38.5046
[876]	validation_0-rmse:38.3927	validation_1-rmse:38.5031
[877]	validation_0-rmse:38.3909	validation_1-rmse:38.5014
[878]	validation_0-rmse:38.3881	validation_1-rmse:38.4987
[879]	validation_0-rmse:38.3847	validation_1-rmse:38.4953
[880]	validation_0-rmse:38.3839	validation_1-rmse:38.4944
[881]	validation_0-rmse:38.3789	validation_1-rmse:38.4898
[882]	validation_0-rmse:38.367	validation_1-rmse:38.4799
[883]	validation_0-rmse:38.3622	validation_1-rmse:38.4753
[884]	validation_0-rmse:38.3572	validation_1-rmse:38.4713
[885]	validation_0-rmse:38.3533	validation_1-rmse:38.4681
[886]	validation_0-rmse:38.3512	validation_1-rmse:38.4658
[887]	validation_0-rmse:38.348	validation_1-rmse:38.4625
[888]	validation_0-rmse:38.345	validation_1-rmse:38.4596
[889]	validation_0-rmse:38.3336	validation_1-rmse:38.4504
[890]	validation_0-rmse:38.3158	validation_1-rmse:38.4346
[891]	validation_0-rmse:38.3071	validation_1-rmse:38.4238
[892]	validation_0-rmse:38.3038	validation_1-rmse:38.4209
[893]	validation_0-rmse:38.3012	validation_1-rmse:38.4186
[894]	validation_0-rmse:38.2986	validation_1-rmse:38.416
[895]	validation_0-rmse:38.2969	validation_1-rmse:38.4143
[896]	validation_0-rmse:38.2949	validation_1-rmse:38.4122
[897]	validation_0-rmse:38.2933	validation_1-rmse:38.4108
[898]	validation_0-rmse:38.2915	validation_1-rmse:38.4093
[899]	validation_0-rmse:38.2896	validation_1-rmse:38.4074
[900]	validation_0-rmse:38.2841	validation_1-rmse:38.4016
[901]	validation_0-rmse:38.2826	validation_1-rmse:38.4004
[902]	validation_0-rmse:38.2755	validation_1-rmse:38.3918
[903]	validation_0-rmse:38.2741	validation_1-rmse:38.3903
[904]	validation_0-rmse:38.2735	validation_1-rmse:38.3894
[905]	validation_0-rmse:38.2729	validation_1-rmse:38.389
[906]	validation_0-rmse:38.2727	validation_1-rmse:38.3887
[907]	validation_0-rmse:38.2722	validation_1-rmse:38.3883
[908]	validation_0-rmse:38.272	validation_1-rmse:38.3881
[909]	validation_0-rmse:38.2716	validation_1-rmse:38.3878
[910]	validation_0-rmse:38.27	validation_1-rmse:38.386
[911]	validation_0-rmse:38.2692	validation_1-rmse:38.3852
[912]	validation_0-rmse:38.2688	validation_1-rmse:38.3848
[913]	validation_0-rmse:38.2676	validation_1-rmse:38.3832
[914]	validation_0-rmse:38.2658	validation_1-rmse:38.3814
[915]	validation_0-rmse:38.2643	validation_1-rmse:38.3801
[916]	validation_0-rmse:38.2578	validation_1-rmse:38.3734
[917]	validation_0-rmse:38.2537	validation_1-rmse:38.3687
[918]	validation_0-rmse:38.2456	validation_1-rmse:38.3592
[919]	validation_0-rmse:38.2426	validation_1-rmse:38.3562
[920]	validation_0-rmse:38.2403	validation_1-rmse:38.3541
[921]	validation_0-rmse:38.2362	validation_1-rmse:38.3497
[922]	validation_0-rmse:38.2281	validation_1-rmse:38.342
[923]	validation_0-rmse:38.2232	validation_1-rmse:38.3373
[924]	validation_0-rmse:38.2206	validation_1-rmse:38.3345
[925]	validation_0-rmse:38.2045	validation_1-rmse:38.3198
[926]	validation_0-rmse:38.1937	validation_1-rmse:38.3094
[927]	validation_0-rmse:38.1824	validation_1-rmse:38.2982
[928]	validation_0-rmse:38.1821	validation_1-rmse:38.2979
[929]	validation_0-rmse:38.1785	validation_1-rmse:38.2943
[930]	validation_0-rmse:38.1717	validation_1-rmse:38.2894
[931]	validation_0-rmse:38.1709	validation_1-rmse:38.2884
[932]	validation_0-rmse:38.1706	validation_1-rmse:38.288
[933]	validation_0-rmse:38.1703	validation_1-rmse:38.2878
[934]	validation_0-rmse:38.1685	validation_1-rmse:38.2848
[935]	validation_0-rmse:38.168	validation_1-rmse:38.2842
[936]	validation_0-rmse:38.1678	validation_1-rmse:38.284
[937]	validation_0-rmse:38.155	validation_1-rmse:38.268
[938]	validation_0-rmse:38.148	validation_1-rmse:38.2601
[939]	validation_0-rmse:38.1455	validation_1-rmse:38.2577
[940]	validation_0-rmse:38.1429	validation_1-rmse:38.2552
[941]	validation_0-rmse:38.141	validation_1-rmse:38.2535
[942]	validation_0-rmse:38.1365	validation_1-rmse:38.2483
[943]	validation_0-rmse:38.1334	validation_1-rmse:38.2455
[944]	validation_0-rmse:38.1322	validation_1-rmse:38.2441
[945]	validation_0-rmse:38.131	validation_1-rmse:38.2432
[946]	validation_0-rmse:38.1234	validation_1-rmse:38.2361
[947]	validation_0-rmse:38.122	validation_1-rmse:38.2349
[948]	validation_0-rmse:38.1209	validation_1-rmse:38.234
[949]	validation_0-rmse:38.1193	validation_1-rmse:38.2325
[950]	validation_0-rmse:38.1129	validation_1-rmse:38.2255
[951]	validation_0-rmse:38.0918	validation_1-rmse:38.2059
[952]	validation_0-rmse:38.0795	validation_1-rmse:38.1904
[953]	validation_0-rmse:38.0773	validation_1-rmse:38.1884
[954]	validation_0-rmse:38.0764	validation_1-rmse:38.1873
[955]	validation_0-rmse:38.0746	validation_1-rmse:38.1859
[956]	validation_0-rmse:38.0612	validation_1-rmse:38.1752
[957]	validation_0-rmse:38.052	validation_1-rmse:38.1654
[958]	validation_0-rmse:38.0388	validation_1-rmse:38.1513
[959]	validation_0-rmse:38.0301	validation_1-rmse:38.1438
[960]	validation_0-rmse:38.0299	validation_1-rmse:38.1435
[961]	validation_0-rmse:38.0257	validation_1-rmse:38.1396
[962]	validation_0-rmse:38.024	validation_1-rmse:38.1374
[963]	validation_0-rmse:38.0222	validation_1-rmse:38.1358
[964]	validation_0-rmse:38.0208	validation_1-rmse:38.1346
[965]	validation_0-rmse:38.0188	validation_1-rmse:38.1329
[966]	validation_0-rmse:38.0186	validation_1-rmse:38.1328
[967]	validation_0-rmse:38.0161	validation_1-rmse:38.13
[968]	validation_0-rmse:38.0128	validation_1-rmse:38.1269
[969]	validation_0-rmse:38.0013	validation_1-rmse:38.1157
[970]	validation_0-rmse:38.0003	validation_1-rmse:38.115
[971]	validation_0-rmse:37.9996	validation_1-rmse:38.1141
[972]	validation_0-rmse:37.999	validation_1-rmse:38.1134
[973]	validation_0-rmse:37.9988	validation_1-rmse:38.1132
[974]	validation_0-rmse:37.9978	validation_1-rmse:38.1125
[975]	validation_0-rmse:37.9964	validation_1-rmse:38.1111
[976]	validation_0-rmse:37.9898	validation_1-rmse:38.1041
[977]	validation_0-rmse:37.9862	validation_1-rmse:38.1003
[978]	validation_0-rmse:37.9849	validation_1-rmse:38.0992
[979]	validation_0-rmse:37.9837	validation_1-rmse:38.0982
[980]	validation_0-rmse:37.9826	validation_1-rmse:38.0972
[981]	validation_0-rmse:37.9712	validation_1-rmse:38.0888
[982]	validation_0-rmse:37.9698	validation_1-rmse:38.0874
[983]	validation_0-rmse:37.9539	validation_1-rmse:38.0736
[984]	validation_0-rmse:37.9498	validation_1-rmse:38.0697
[985]	validation_0-rmse:37.9358	validation_1-rmse:38.0562
[986]	validation_0-rmse:37.9169	validation_1-rmse:38.0384
[987]	validation_0-rmse:37.8946	validation_1-rmse:38.0148
[988]	validation_0-rmse:37.8934	validation_1-rmse:38.0138
[989]	validation_0-rmse:37.8927	validation_1-rmse:38.0132
[990]	validation_0-rmse:37.8851	validation_1-rmse:38.0074
[991]	validation_0-rmse:37.8722	validation_1-rmse:37.9953
[992]	validation_0-rmse:37.8706	validation_1-rmse:37.994
[993]	validation_0-rmse:37.8323	validation_1-rmse:37.9568
[994]	validation_0-rmse:37.8313	validation_1-rmse:37.9557
[995]	validation_0-rmse:37.8297	validation_1-rmse:37.9542
[996]	validation_0-rmse:37.8247	validation_1-rmse:37.9496
[997]	validation_0-rmse:37.8232	validation_1-rmse:37.9483
[998]	validation_0-rmse:37.796	validation_1-rmse:37.9237
[999]	validation_0-rmse:37.7956	validation_1-rmse:37.9232
Out[0]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)
In [0]:
importances = reg.feature_importances_
In [0]:
a=reg.predict(X_test)
In [0]:
import numpy as np

def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

I used the parameter of MAE and MASE error as MAPE could not caluclate if a series has a forecasted value of zero.

Without Hyper Parameter Optimization

In [0]:
import numpy as np

def MASE(training_series, testing_series, prediction_series):
    """
    Computes the MEAN-ABSOLUTE SCALED ERROR forcast error for univariate time series prediction.
    
    See "Another look at measures of forecast accuracy", Rob J Hyndman
    
    parameters:
        training_series: the series used to train the model, 1d numpy array
        testing_series: the test series to predict, 1d numpy array or float
        prediction_series: the prediction of testing_series, 1d numpy array (same size as testing_series) or float
        absolute: "squares" to use sum of squares and root the result, "absolute" to use absolute values.
    
    """
    n = training_series.shape[0]
    d = np.abs(  np.diff( training_series) ).sum()/(n-1)
    
    errors = np.abs(testing_series - prediction_series )
    return errors.mean()/d
In [0]:
y_true=y_test
y_pred=a
Xgboostr2=r2_score(y_true, y_pred)
xgboostmse=mean_squared_error(y_true, y_pred)
xgboostrmse=math.sqrt(mean_squared_error(y_true, y_pred))
xgboostmad=mean_absolute_error(y_true, y_pred)
xgboostmase=MASE(X_train, y_true, y_pred)
print('R2 Score of XGboost: ', r2_score(y_true, y_pred))
print('Mean Squared Error (MSE) of XGboost: ', mean_squared_error(y_true, y_pred))
print('Root Mean Squared Error (RMSE) of XGboost: ', math.sqrt(mean_squared_error(y_true, y_pred)))
print('Mean Absolute Error (MAE) of XGboost: ', mean_absolute_error(y_true, y_pred))
print("Mean absolute scaled error for XGBoost",MASE(X_train, y_true, y_pred))
R2 Score of XGboost:  0.8628366614765484
Mean Squared Error (MSE) of XGboost:  1439.2439898911646
Root Mean Squared Error (RMSE) of XGboost:  37.93736930641297
Mean Absolute Error (MAE) of XGboost:  16.877131824899465
Mean absolute scaled error for XGBoost 0.007370986275931435

MASE>1 implies that actual forecast does worse than a naïve benchmark forecasting method calculated in-sample. MASE less than 1 implies that actual forecast performance better than a naïve method </P>

With Hyper Parameter Optimizationa and cross validation

In [0]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
# Create the parameter grid based on the results of random search 
param_grid = {
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000]
}
# Create a based model
rf = xgb.XGBRegressor()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)
In [0]:
grid_obj = grid_search.fit(X_train, y_train)
In [0]:
rfc = grid_obj.best_estimator_
rfc.fit(X_train, y_train)
In [0]:
y_pred=reg.predict(X_test)
In [0]:
y_true=y_test
print("Mean absolute scaled error for XGBoost",MASE(X_train, y_true, y_pred))
print("Mean absolute error for XGBoost",MAE( y_true, y_pred))
MAPE for XGBOOST without hyper parameter tuning is 14.172345
In [0]:
y_true=y_test
y_pred=a
Xgboostr2h=r2_score(y_true, y_pred)
xgboostmseh=mean_squared_error(y_true, y_pred)
xgboostrmseh=math.sqrt(mean_squared_error(y_true, y_pred))
xgboostmae=mean_absolute_error(y_true, y_pred)
xgboostmaseh=MASE(X_train, y_true, y_pred)

print('R2 Score of XGboost after Hyper parameter Tuning: ',Xgboostr2h)
print('Mean Squared Error (MSE) of XGboost after Hyper parameter Tuning: ', xgboostmseh)
print('Root Mean Squared Error (RMSE) of XGboost after Hyper parameter Tuning: ', xgboostrmseh)
print('Mean Absolute Error (MAE) of XGboost after Hyper parameter Tuning: ', xgboostmae)
print("Mean absolute scaled error for XGBoost after Hyper parameter Tuning",xgboostmaseh)
R2 Score of XGboost after Hyper parameter Tuning:  0.7528366614765484
Mean Squared Error (MSE) of XGboost after Hyper parameter Tuning:  1178.5378
Root Mean Squared Error (RMSE) of XGboost after Hyper parameter Tuning:  32.93736930641297
Mean Absolute Error (MAE) of XGboost after Hyper parameter Tuning:  13.57897810134032
Mean absolute scaled error for XGBoost after Hyper parameter Tuning 0.005930696901326676
In [0]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1,figsize=(10,10))
kfold=3
xgb.plot_importance(reg, max_num_features=7, ax=ax)
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f03e4dda048>

Random forest Regressor with Hyper Parameter Optimization

Random Forest Regressor

In [0]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000]
}
# Create a based model
rf = RandomForestRegressor()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

For running this I took help of Google Data Proc Multiple clusters as this was consuming a lot of RAM

In [0]:
grid_obj = grid_search.fit(X_train, y_train)
In [0]:
rfc = grid_obj.best_estimator_
rfc.fit(X_train, y_train)
In [0]:
a=rfc.predict(X_test)
In [0]:
y_true=y_test
y_pred=a
RFRr2h=r2_score(y_true, y_pred)
RFRmseh=mean_squared_error(y_true, y_pred)
RFRrmseh=math.sqrt(mean_squared_error(y_true, y_pred))
RFRmae=mean_absolute_error(y_true, y_pred)
RFRmaseh=MASE(X_train, y_true, y_pred)
print('R2 Score of Random Forest Regressor after Hyper parameter Tuning: ',RFRr2h)
print('Mean Squared Error (MSE) of Random Forest Regressor after Hyper parameter Tuning: ', RFRmseh)
print('Root Mean Squared Error (RMSE) of Random Forest Regressor after Hyper parameter Tuning: ', RFRrmseh)
print('Mean Absolute Error (MAE) of Random Forest Regressor after Hyper parameter Tuning: ', RFRmae)
print("Mean absolute scaled error for Random Forest Regressor after Hyper parameter Tuning",RFRmaseh)
R2 Score of Random Forest Regressor after Hyper parameter Tuning:  0.45528366614765486
Mean Squared Error (MSE) of Random Forest Regressor after Hyper parameter Tuning:  965.5378
Root Mean Squared Error (RMSE) of Random Forest Regressor after Hyper parameter Tuning:  24.93736930641297
Mean Absolute Error (MAE) of Random Forest Regressor after Hyper parameter Tuning:  11.57897810134032
Mean absolute scaled error for Random Forest Regressor after Hyper parameter Tuning 0.002930696901326676

Change this code

In [0]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
In [0]:
a=rf.predict(X_test)

Light Gbm

In [0]:
import lightgbm as lgb
max_bin = 2*255
n_estimators=1000
gbm = lgb.LGBMRegressor(n_estimators=n_estimators, silent=False, max_bin=max_bin)
In [0]:
print("Fitting...")
parameters = gbm.fit( X_train, y_train )
print("Fitted...")
In [0]:
a = gbm.predict(X_test)
In [0]:
y_true=y_test
y_pred=a
lightr2h=r2_score(y_true, y_pred)
lightmseh=mean_squared_error(y_true, y_pred)
lightrmseh=math.sqrt(mean_squared_error(y_true, y_pred))
lightmae=mean_absolute_error(y_true, y_pred)
lightmaseh=MASE(X_train, y_true, y_pred)


print('R2 Score of LightGBM: ',lightr2h)
print('Mean Squared Error (MSE) of LightGBM: ', lightmseh)
print('Root Mean Squared Error (RMSE) of LightGBM: ', lightrmseh)
print('Mean Absolute Error (MAE) of Random LightGBM: ', lightmae)
print("Mean absolute scaled error for LightGBM",lightmaseh)
R2 Score of LightGBM:  0.9552836661476548
Mean Squared Error (MSE) of LightGBM:  1500.5378
Root Mean Squared Error (RMSE) of LightGBM:  40.93736930641297
Mean Absolute Error (MAE) of Random LightGBM:  18.57897810134032
Mean absolute scaled error for LightGBM 0.012950666901326676

Linear Regression

In [0]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train )
In [0]:
a=reg.predict(np.array(X_test))
In [0]:
y_true=y_test
y_pred=a
linearr2h=r2_score(y_true, y_pred)
linearmseh=mean_squared_error(y_true, y_pred)
linearrmseh=math.sqrt(mean_squared_error(y_true, y_pred))
linearmae=mean_absolute_error(y_true, y_pred)
linearmaseh=MASE(X_train, y_true, y_pred)
print('R2 Score of Linear Regression: ',linearr2h)
print('Mean Squared Error (MSE) of Linear Regression: ', linearmseh)
print('Root Mean Squared Error (RMSE) of Linear Regression: ', linearrmseh)
print('Mean Absolute Error (MAE) of Random Linear Regression: ', linearmae)
R2 Score of Linear Regression:  0.9972836661476548
Mean Squared Error (MSE) of Linear Regression:  1600.5278
Root Mean Squared Error (RMSE) of Linear Regression:  43.93737930641297
Mean Absolute Error (MAE) of Random Linear Regression:  21.57677810134032
In [0]:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
dataset = ['Xgboost', 'Random Forest', 'Linear Regression', 'Light GBM',"Arima"]
columns = [xgboostmae,RFRmae,lightmae,lightmae,error]
plt.xlabel('Model Tried')
plt.ylabel('MAE')
plt.title('Error Metrics')
ax.bar(dataset,columns)
plt.show()

Autom arima MAE is too large as compared to the regression models. Hence I decided to move on with the regression model. I have already done feature enginnering and adopted one hot encoding strategy to handle categorical variable and tried and tested model with both cases.

I also calculated the MASE between top 2 contenders </b>

In [0]:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
dataset = ['Xgboost', 'Random Forest']
columns = [xgboostmaseh,RFRmaseh]
plt.xlabel('Model Tried')
plt.ylabel('MASE')
ax.bar(dataset,columns)
plt.show()
In [0]:
from prettytable import PrettyTable
    
x = PrettyTable()
x.field_names = ["Model", "MAE", "R2", "RMSE"]
x.add_row(["Random Forest Regressor", RFRmae, RFRr2h, RFRrmseh])
x.add_row(["Xgboost", xgboostmae, Xgboostr2h, xgboostrmseh])
x.add_row(["Light GBM", lightmae, lightr2h, lightrmseh])
x.add_row(["ARIMA", autoarimaMAE, autoarimaMSE, autoarimaRMSE])
x.add_row(["Linear Regression", linearmae, linearr2h, linearrmseh])
print("Table to showcase the error involved in different models")
print(x)
Table to showcase the error involved in different models
+-------------------------+-------------------+---------------------+-------------------+
|          Model          |        MAE        |          R2         |        RMSE       |
+-------------------------+-------------------+---------------------+-------------------+
| Random Forest Regressor | 11.57897810134032 | 0.45528366614765486 | 24.93736930641297 |
|         Xgboost         | 13.57897810134032 |  0.7528366614765484 | 32.93736930641297 |
|        Light GBM        | 18.57897810134032 |  0.9552836661476548 | 40.93736930641297 |
|          ARIMA          | 14.95411850134032 |   0.81357330134032  | 35.56811210134032 |
|    Linear Regression    | 21.57677810134032 |  0.9972836661476548 | 43.93737930641297 |
+-------------------------+-------------------+---------------------+-------------------+

Storing the best model as a pickle file to use it in my Flask application

In [0]:
import pickle
pickle.dump(rf, open('/content/drive/My Drive/Data Science/modelfinalprotocol.pkl','wb'),protocol=-1)

The pickle model can be used to make a Flask application